Abstract

In this work, an optical-flow-based pose tracking method with long short-term memory for known uncooperative spacecraft is proposed. In combination with the segmentation network, we constrain the optical flow area of the target to cope with harsh lighting conditions and highly textured background. With the introduction of long short-term memory structure, the proposed method can maintain a robust and accurate tracking performance even in a long-term sequence of images. In our experiments, the pose tracking effects in the synthetic images as well as the SwissCube dataset images are tested, respectively. By comparing with the state-of-the-art pose tracking frameworks, we demonstrate the performance of our method and in particular the improvements under complex environments.

1. Introduction

Known uncooperative spacecraft 6D pose tracking is crucial in on-orbit operations, e.g., docking, rendezvous, servicing, and space debris removal [1]. These rely on precise and robust estimation of the relative pose under harsh lighting conditions and against highly textured background [2]. Considering the size, mass, and power, monocular sensors can ensure rapid pose determination for noncooperative target with lower power, lower hardware complexity and cost, and mass requirements.

In the method of traditional but state-of-the-art monocular pose determination for spaceborne applications, the features in images are first extracted (e.g., edges, corners, and lines). Through the matching between the features of the images, they establish the correspondences between the 2D pixels and the initial 3D model and then calculate the relative pose of the target. In order to maintain high overlap, matching is often performed on adjacent images, which could obtain more accurate correspondences. This process of pose tracking cannot avoid the accumulation of errors, which will even lead to the failure of the tracking mission.

As objects are more than just a collection of edges and geometric primitives, for some satellites whose models are known or have been roughly reconstructed, traditional vision methods cannot effectively use the information of the model, while convolutional neural networks (CNNs) can learn more complex and meaningful features to the task at hand while ignoring background features (e.g. clouds) based on context. Over the past decade, nearly all computer vision tasks have become increasingly dominated by CNNs [36]. Compared to prior techniques, CNNs have been shown to be more resilient to noise and better able to generalize to previously unseen scenarios. The 2019 Satellite Pose Estimation Challenge (SPEC) [7], hosted by Stanford University and the European Space Agency (ESA), saw all of its top-performing submissions employ CNN-based deep learning models.

Not only that, unlike imagery captured for terrestrial applications, space imagery is characterized by high contrast, low signal-to-noise ratio, and low sensor resolution. In the case of a known model or a relatively complete point cloud obtained in advance, using CNNs has been verified to have more accurate correspondences [8]. So far, after establishing correspondences between the object’s 3D model and 2D pixel locations, the 6D pose is usually calculated by perspective-n-point (PnP) algorithm based on RANSAC [9, 10]. The method based on single-frame image extraction of correspondence has been proven to have a good estimation result [1114].

For space target image sequence under harsh lighting conditions and serious background interference, some images are difficult to extract features correctly, which will cause mistakes in pose estimation based on single frame image. At this point, we should effectively consider the time domain information of the image sequence and establish appropriate features [15]. Wen et al. [16] proposed a novel neural network architecture that could keep long-term track of an object’s pose robustly and efficiently with RGB-D video sequence. To predict the relative pose between the current observation and the synthetic model rendering at the previous prediction, the method cleverly establishes the connection between interframes and makes the tracking effect more stable.

As the motion of the space target often has a certain regularity, long sequence images can achieve better analysis of target motion. The research in [17] introduced a keyframe memory pool to store the most informative historical observations. This not only greatly reduces the tracking drift problem but also effectively addressed the noisy segmentation and external occlusions from the interaction.

However, most space images do not have depth maps, and the depth range of the target is very large. Inspired by above methods, on the basis of utilizing interframe information, we could also input a set of image sequences to the network. Through a long-term memory cell, the motion law of the target could be analyzed, and the pose tracking will be more stable and accurate.

Therefore, in this article, a 6D pose tracking network for known uncooperative spacecraft image sequence is constructed, which takes the target 2D interframe transformation of pixels as input and 3D pose transformation as output. The main contributions this work makes are as follows: (1)The target optical flow is used as the description of the 2D pixel transformation between frames. The segmentation network is introduced to constrain the target optical flow area and reduce the cumulative errors caused by the target pixel offset during the pose tracking process(2)A long-term memory cell of LSTM network is used to learn the movement characteristics of the space target. By inputting a sequence of target optical flow, a robust and precise tracking network is established

This paper focuses on the 6D pose tracking based on interframe relevance. Over the years, many methods have been proposed to describe the transformation between adjacent frames. The optical flow has been widely used in computer vision fields owing to its better tracking effect at small angle changes. Different from traditional feature point matching, optical flow can obtain more accurate interframe pixel changes to determine the correspondence, which could effectively utilize interframe information and obtain higher matching accuracy.

Among these flow-based tracking methods, sparse optical flow is widely used in many applications due to its high efficiency and accurate calculation method [18]. For space images, considering the complex lighting conditions and highly textured background, partial overexposure or overdarkness happens from time to time. In order not to rely too much on a small number of feature points, as an image registration method for image point-by-point matching, dense optical flow is introduced to calculate the offset of all points on the image to form a dense optical flow field.

Although dense optical flow is proven to be robust to rapid and irregular motion, since the optical flow of each pixel participates in the calculation, the computational burden obviously increases, and Brox et al. [19] proposed a combination of contour matching and optical flow. The purpose of this method is to use staggered contour matching to reduce the accumulation of tracking errors and optimize the tracking effect. This method proves that the integration of two constraints cannot only improve the calculation efficiency but also avoid the cumulative shift in the pixel tracking process.

In order to establish the constraints on the target area and focus on the target optical flow, we first segment the image independently. Encouraged by the successful application of deep learning technology in object classification and recognition [2023], various researchers have made attempts to explore to solve the semantic segmentation problem of image pixel-level labels. Since the target model is known, the fully convolutional networks (FCN) [24] could be used to segment the image at the pixel level and obtain a higher-precision mask. The FCN network replaces the fully connected layer in the convolutional neural network with the convolutional layer, so that each pixel generates a prediction value to output the segmented image instead of the classification score and restores the feature map to the original image size through the deconvolution layer. Due to the unity of space background types, the method is more efficient and accurate, and the two-dimensional pixels of the independent segmentation can also prevent errors caused by accumulated offsets.

Although the combination of dense optical flow and segmentation network solves the problem of the transformation of two-dimensional pixels, we still need to make full use of the regularity of the target movement in the image sequence. Without introducing other sensors besides the camera, some optimization algorithms could be introduced to analyze the movement pattern of the target. Shantaiya et al. [25] combine optical flow and the Kalman filter to track the pose of the video dataset. This method can reduce the cumulative error in the tracking process when the target movement has certain regularity. In robotics, the time information in video data is also very important for the optimization of pose estimation and plays an important role in tasks such as route planning and active sensing [2630].

With the development of deep learning, researchers have begun to consider the use of networks to solve continuous and multidimensional data prediction tasks. Wang et al. [31] proposed a method based on recurrent neural network (RNN) to estimate the depth and camera pose of a multiview monocular video sequence. By solving the gradient disappearance problem of RNN, the long short-term memory network (LSTM) has become one of the most advanced networks for processing time series-related data, which has been applied in fields of speech recognition [32], image and caption generation [33, 34], and multidimensional image processing [35]. By combining input gate, forget gate, and output gate, LSTM cannot only analyze the characteristics of adjacent frames in a short period of time but also learn the motion laws of targets through long-term memory units, optimize tracking performance using time-domain information, and perform well in the long-term tracking process. Many technologies also introduce LSTM to pose-related tasks, such as motion tracking and motion recognition [3640]. Through the long-term memory cell, target transformation in the entire image sequence can be used for estimation.

However, for image tracking, only LSTM is not enough. We also need to convert the image into features for storage and analysis to achieve better results. Therefore, in this article, the target optical flow area is first constrained based on the FCN target segmentation results, and the convolutional long short-term memory (ConvLSTM) architecture is introduced to combine with our pose estimation network, thus effectively utilizing the time domain information of the image sequence. On this basis, the cumulative errors generated in the tracking process can be also reduced.

3. Approach

Given a sequence of input images, our goal is to estimate their 3D rotation and translation. We assume that the target is rigid and that the 3D model is available. In this section, we use the image information of a sequence to calculate the dense optical flow between images and obtain the optical flow information of the target area using the mask obtained by the FCN network. Optical flow information describes the relative movement of the target between image frames. In order to more accurately estimate the 6D pose of the target, we need to add the target’s two-dimensional and three-dimensional coordinates into the input of the network which needs to enrich the structural information.

Therefore, we use a two-dimensional matrix composed of the initial position of the target optical flow (that is, the 2D target object pixel points of the previous frame), the corresponding optical flow value, and the 3D points of the corresponding pose as the input of the network. Figure 1 depicts the overall architecture for target optical flow pose tracking. In the remainder of this section, we describe each stream in detail.

3.1. Dense Flow

Optical flow contains rich motion information, so it can be used to predict the direction and speed of object movement. When using dense sampling, it is necessary to calculate the dense optical flow for each pixel in the image. Dense optical flow can reduce the uncertainty caused by feature extraction compared with sparse optical flow.

As shown in Figure 2, given two images, small nonoverlapping patches of pixels are extracted from the first image, and the patch in the first image is used to convolve with the patch in the second image to get the response map. After the calculation of this first-level response mapping, we use sparse convolution on the response map to calculate the response map of the larger patch. This procedure produces a pyramid of response maps.

By using a max pooling operator, the response map is guaranteed to be the same as the response map when the patch moves by one pixel. Since the response map changes slowly in space, a downsampling step is introduced to reduce the complexity of the following steps. In order to prevent the response map from converging too fast during sparse convolution, nonlinear filtering is introduced, and the position of each subpatch is optimized around pixels.

The response mapping pyramid is constructed using a bottom-up method, while the top-down method is used to extract correspondence. The local maximum value in the different layers of the response mapping pyramid can generate the correspondences between local image patches via matching. To obtain dense correspondences between matched patches (i.e., at local maxima), it suffices to recover the path of response values that generate this maximum. For each local maximum, each layer in the pyramid is retrieved from the response map to generate a quasidense correspondence.

According to the optical flow, the corresponding relationship of the 2D points between the two frames can be obtained. For each 2D point in the previous frame, there will be a 2D point in the new frame corresponding to it. We have

We projected the 3D points of the target with a known pose in the previous frame to the image plane and calculated the dense optical flow between the two frames. On this basis, 2D-3D point correspondences can be established based on 3D points’ new positions in the current frame, and the new image 6D pose can be predicted.

Assume there is a camera intrinsic parameter matrix , the 2D pixel points of the target in the image, and its corresponding 3D points ; then, we have

where is a scale factor and and are the rotation matrix and translation vector that define the camera pose, respectively. Since is a rotation, which has three degrees of freedom, and likewise, so and have a total of 6 degrees of freedom.

3.2. Target Segmentation

The dense optical flow calculated for the interframe image contains all the pixels, but we mainly focused on the optical flow information of the target in the image and obtained the 6D estimation information of the target. The fully convolutional neural (FCN) network can segment the image at the pixel level to acquire the target mask and then determine the optical flow of the target area. FCN uses the deconvolution layer to upsample the feature map of the last convolution layer and restore it to the same size of the input image, so that a prediction can be generated for each pixel while retaining the original input image information in the space. The convolutional network architecture adopted in this article is shown in Figure 3.

FCN uses the strip structure to superimpose the result of the deconvolution and the corresponding forward feature map, thereby obtaining more accurate pixel-level segmentation. By upsampling the low-resolution feature map with rich semantic information to the same size as the high-resolution feature map with rich edge information, and then adding them as the final semantic feature map, the robustness and accuracy can be ensured. We used FCN-8s to perform upsampling 3 times, fused the 3 deconvolution result images, and obtained the final prediction result.

By using the VGG model as the prenetwork for initialization, transfer learning and fine-tuning were conducted on the basis of pretraining. By overlaying the target mask with the image’s dense flow field, we can obtain the dense optical flow of the target in the image.

3.3. Flow-Based Pose Tracking
3.3.1. Pose Estimation Network

In addition to the optical flow, the initial two-dimensional points’ position of the optical flow also contains the initial pose information of the target. In order to obtain the three-dimensional post change between frames of the target, we need to provide the initial pose information for the pose estimation network.

To this end, a simple network architecture was constructed to predict the pose from the target optical flow, as shown in Figure 4. It comprises three main modules [41], including a local feature extraction module with shared network parameters, a feature aggregation module, and a global inference module made of simple fully connected layers.

In the above network structure, an MLP with three layers was used to extract local features for the target optical flow, with weights shared across the target optical flow. A single max-pooling operation was carried out for aggregation so as to obtain a fixed dimension of the context representation and avoid bringing extra parameters. Finally, we aggregated the -dimensional vector into the 6D pose output through another MLP. To this end, we adopted three fully connected layers and encoded the final pose as a quaternion and a translation.

The deep network described above gives us a differentiable approach to predict the 6D pose from target optical flow clusters for a given object. Given the target optical flow from FCN and dense flow, we need to establish the relationship between 2D and 3D transformation. To do so, another deep regressor with parameters was adopted. where is the target optical flow, is the 2D initial position of the target optical flow, is the 3D point of the corresponding pose, represents the posture change, and represents the translation change.

3.3.2. Flow-Based Pose Tracking with ConvLSTM

As an extension of recurrent neural networks (RNN), LSTM introduces a cell to remember or forget information adaptively. The long-term memory unit of LSTM provides correlations of consecutive frames, and short-term memory is used to infer the current state. This can refine outputs to improve the current estimation and reduce accumulated errors over a long sequence. As shown in Figure 5, LSTM mainly includes input gate, forget gate, and output gate. After initializing the long-term memory cell and hidden cell, the input and hidden cell will enter the LSTM network together. Effective information will be processed through input gate and entered into long-term memory cell, while invalid information will be forgotten. Finally, a part of the output from the previous stage will become the input for the next stage.

ConvLSTM was developed based on LSTM, but a convolutional layer was added to better handle image spatial features. Its structure first convolves the input multidimensional matrix and then passes through the pose estimation unit constructed in the previous section. At the same time as outputting the results, a hidden unit is output. This hidden cell will form an input with the next frame of the image and enter the next pose estimation cell together. At the same time, each frame of the image will retain some information when entering the estimation cell, which is the long-term memory cell. This cell will be used to learn the motion laws of the target. Through continuous input of sequence data, the learning effect will gradually be optimized, ultimately affecting subsequent pose estimation.

As shown in Figure 6, assuming represents the transformation of the pose, recurrent according to the stage, all the interframe data of 11 consecutive frames will be fed into the network to estimate the from the 10th frame to the 11th frame. After obtaining the pose of the 11th frame, we used the images of frames 2-12 to estimate of the 11th and 12th frames and so on.

To be more specific, a sequence of optical flow between frames will be used to predict the following 6D pose of a new frame. Given the input image, we took 11 images as a sequence to calculate the dense optical flow between frames. represents the frame .

The dense optical flow of the target area in the image was used as the input of the following network:

The network model can be written as Equation (6).

To train it, we minimized the loss function as Equation (7).

where and are the estimated rotation matrix and translation vector, respectively; and are the ground-truth ones. The rotations are calculated from the estimated and ground-truth quaternions, which can be carried out in a differentiable manner.

4. Experiments

Based on both synthetic data and real data from the dataset of SwissCube [42], the proposed flow-based pose tracking approach was compared with more traditional but state-of-the-art pose tracking frameworks and PnP-net [41].

4.1. Metric

In terms of 3D error, the average distance error is converted by a 3D model based on the predicted pose and the real pose, respectively, which is called ADD error [43]. The pose accuracy in a sequence of image is calculated. In all test sets, the ADD-0.1d were reported, for which the predicted poses are considered to be correct if ADD error is smaller than 10% of the model diameter. In addition, the pose accuracy utilizes rotation and translation errors proposed in the literature [44].

4.2. Synthetic Data

To simulate the sequence of optical flow, we first give the target an initial pose and and then made it transform regularly. With knowing the 3D mesh of the target and using a virtual calibrated camera (image size , focal length 90 mm), we projected the 3D mesh on the phase plane. After obtaining the 2D pixel coordinates, we simulated the optical flow by the transform of the target pixel coordinates. Then, the sequence of the optical flow will be given to LSTM as the input.

Recalling from Section 3.3, the network regresses pose from the target optical flow information and expects 2D inputs in the form , where represent the points cloud under the current frame image, represent the starting position of the target optical flow, and represent the flow.

Obviously, the target’s two-dimensional pixels and their optical flow information are too much, so we took the resulting target optical flow from 1000 randomly sampled grid cells within the target mask. In order to simulate the solving error of optical flow, Gaussian error was added to and . At the same time, outliers were added on (target 2D coordinates) in order to simulate the extraction error of FCN.

We trained our net for 350 epochs on 35 K synthetic training target optical flows with batch size of 32, and a learning rate of 0.0001 using the Adam optimizer. During training, we randomly added 2D noise with variance in the range of (0, 15) pixels and created 0% to 20% of outliers. We used 5 K sets of synthetic optical flow data and reported the mean pose accuracy in terms of the 3D space reconstruction error.

Combining a PnP algorithm with RANSAC is the most widespread approach to handle noisy correspondences [12]. The PnP-net pose estimation based on correspondence is the latest network alternative algorithm of the EPnP method. As shown in Figure 7, we compare the performance of these methods in a sequence of images. Moreover, we provide the accuracy of pose estimation on all test datasets, as shown in Table 1.

From the results of Figure 8, when the noise level is below 5%, the pixel coordinates of the target key points will be accurate. The precise pixel coordinates indicate that the transformation between the two sets of vectors is very consistent, and the matrix (rotation and translation) estimated by EPnP is more accurate. In fact, there is always a significant error in the pixel coordinates (i.e., 2D coordinates) of the extracted key points, and the level of positional noise interference is often higher than 10% or even 20%. At this point, PnP-net and the proposed method are much more accurate and robust to the increasing noise. But for different proportions of outliers, the network structure proposed in this article has better anti-interference ability. In a long-term sequence, the proposed method has a more stable tracking effect and better pose tracking accuracy.

4.3. Real Data

The proposed method was validated based on real data from a standard space target dataset. SwissCube dataset [42] is the newest dataset of space target, which comprise about 40 K real images from 100 video sequence, and parts of the original images are shown in Figure 9. The depth range is to , where indicates the diameter of the target without taking the antennas into account.

4.3.1. Training Procedure

By inputting all the images and training the FCN network with the known true value of the target mask image, a target mask segmentation network can be obtained. The training set contains 350 videos with a total of frames of images, a total of 300 epochs are trained, and the number of samples in each epoch is 30,000.

In the first part of the network, we obtained the target area mask for the subsequent determination of the target optical flow. The extraction result for the SwissCube target area is shown in Figure 10 (no background) and Figure 11 (with background).

Assuming that we know the true poses of the initial 10 images, first, the mask map of the target in each image calculated by the network in part 1 was adopted to extract 1000 pixels on the target, and then the dense flow map was introduced to determine the target optical flow. The optical flow values of 1000 pixels were used to obtain the 3D mesh corresponding to these 1000 2D points based on the known real pose back-projection, forming a matrix, with 10 groups of optical flow including a total of target optical flow sequences. As the input of the ConvLSTM network, the of the 11th frame image and the 10th frame image target was obtained, and supervised learning was performed according to the real value. The ConvLSTM network training set contains 350 videos with a total of sequences, a total of 300 epochs are trained, and the number of sequence samples in each epoch is 30,000. The calculated target optical flow field is shown in Figure 12.

4.3.2. Pose Tracking Results

Using the abovementioned target optical flow composition sequence, the tracking results obtained are shown in Figure 13. We use PnP-net and PnP+KF (Kalman’s filter) for comparative experiments. To this end, we calculated the pose tracking error in a sequence of images, as shown in Figure 14, and analyzed the percentage of correct estimates in all test sets, as shown in Table 2.

As shown in Table 2, traditional optical flow method has a relatively large error compared to other methods, which is mainly caused by cumulative errors. When using target optical flow in the network for end-to-end pose estimation, the error is significantly reduced, which is similar to the accuracy of the PnP-net algorithm. However, compared with other methods that introduce time domain optimization algorithm, the correctly estimated quantity is about 5% less. It can be summarized from the tracking error results in Figure 14 and Table 2 that using the interframe target optical flow as input can obtain a better pose tracking effect, and the pose estimation result is more stable than that of PnP-net [39], wide-depth-range [40], and PnP+KF.

5. Conclusion

In this paper, a relative pose tracking method based on the target optical flow ConvLSTM framework is proposed, which not only uses the target optical flow to improve the accuracy of the interframe pose estimation but also realizes the analysis of the target motion law with the long-term sequence. The experimental data verifies the feasibility and effectiveness of the proposed method based on synthetic data and standard SwissCube dataset under highly textured background influence and harsh lighting conditions. The result shows that the proposed method is more stable and accurate in pose tracking for known uncooperative spacecraft image sequence compared with other methods. Future work will focus on designing an end-to-end network structure so that the target optical flow could be obtained in a direct and efficient way.

Data Availability

The main network structure and other codes used to support the findings of this study have been deposited in the link (https://github.com/guodong0909/Flow-based-LSTM).

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.