Abstract

In the era of digital manufacturing, huge amount of image data generated by manufacturing systems cannot be instantly handled to obtain valuable information due to the limitations (e.g., time) of traditional techniques of image processing. In this paper, we propose a novel self-supervised self-attention learning framework—TriLFrame for image representation learning. The TriLFrame is based on the hybrid architecture of Convolutional Network and Transformer. Experiments show that TriLFrame outperforms state-of-the-art self-supervised methods on the ImageNet dataset and achieves competitive performances when transferring learned features on ImageNet to other classification tasks. Moreover, TriLFrame verifies the proposed hybrid architecture, which combines the powerful local convolutional operation and the long-range nonlocal self-attention operation and works effectively in image representation learning tasks.

1. Introduction

The researchers in the field of computer vision have already achieved great progress in the techniques for image recognition; most of these achievements are based on supervised learning methods. For example, ImageNet [1] acts as a large-scale labelled image dataset applicable for all kinds of image learning tasks, among which supervised methods, e.g., ResNet [2] and AlexNet [3], are dominating and providing the state-of-the-art performances. Although with the thriving of semisupervised learning, unsupervised learning, and self-supervised learning, some competitive methods are emerging, e.g., fast-SWA [4], VAT [5], CPC [6], DIM [7], AMDIM [8], and IIC [9]. These methods show that the performance gap between reduced supervised and supervised methods is shrinking; the amount of labels required for training a competitive unsupervised or self-supervised method is dramatically decreasing. It is noted that certain amount of labels as guiding reference for learning methods is too valuable to ignore that usually results in the gradually decreased adoptions of fully unsupervised methods [10]. All these studies above imply that self-supervised methods are becoming more and more promising in the area of image representation learning, yet we have not seen any self-supervised learning method that surpasses the performance of supervised methods in a general perspective. On the other hand, considering the huge amount of image data being generated every day by manufacturing systems, it is reasonable to rethink the methodology of image learning. Specifically, in the environment of manufacturing systems, applications such as robotics, autopilot systems, medical diagnosis, smart home, and smart city systems are generating significant amount of data every day. It is notable that a large portion of data produced in manufacturing systems is image, as shown in Figure 1.

Images are generated by various kinds of manufacturing systems in a sharp speed and are ready to be analyzed though technologies like image classification, object detection, image segmentation, image filtering, denoising, etc. However, it is noted that image data cannot be instantly processed due to performance limitations by the manufacturing systems. Although with the research focus being transferred to subdivision fields of image learning, e.g., medical image processing [11], face recognition [12], image analysis in autonomous vehicle [13], and fault diagnosis in manufacturing system [1417], the capability is gradually catching up with the explosive growth of image data being generated in manufacturing environments. It should be also well noted that, in order to equip models with specific processing capability in subdivision fields, many task-specific image datasets with accurate labelling are created for supervised training. The fact that, it demands tremendous effort to label every image according to the training target, directly strangles the development of supervised learning methods. Thus, many researchers attempt to exploring alternatives of supervised learning, for example, self-supervised methods using the structural information of the image data to supervise the learning process. In this paper, we propose a general solution for image representation learning based on the self-supervised method, with the capability of transferring to subdivision fields with trivial effort.

Current methods of self-supervised learning for image processing can roughly be divided into the following two categories: generative methods and contrastive methods. Generative methods based on Auto-Encoder [1820] and generative adversarial networks (GAN) [2123] rely on reconstruction error in pixel-level to learn image representation. Relying on pixel-based objectives significantly reduces the capability to model correlations or complex structures and makes model heavily focus on low-level features instead of abstract representative features. Contrastive methods [6, 7, 2427] learn image representations by contrasting positive and negative samples in the latent space, which forces model to discard pixel-wise information and focus on the structure and correlation of the image as a whole. While executing an image learning task, the aim is to get a semantic structural embedding of image which is generalized and can be transferred to subdivision tasks which do not dependent on pixel-level details; thus, contrastive methods better fit our purpose. We are inspired by Contrastive Predictive Coding (CPC) introduced in [6] which utilizes a probabilistic contrastive loss (called “the InfoNCE”) to force the model to learn the underlying semantic information that is shared among the input sequence. However, when applying CPC to image representation learning, a major issue needs to be addressed properly; as shown in Figure 2, it is difficult to predict image patches which contain objects that never appear in previous content as CPC lacks of knowledge of long-range dependencies across the entire image. In this paper, we equip the CPC framework with the power of self-attention [28] which is skilled in capturing nonlocal long-range dependencies, in order to learn a better semantic structural representation of image. The intuition is that if sending the latent embeddings of image patches through the self-attention framework (i.e., a transformer encoder architecture), then each patch embedding will have an impression of the content of other patches, and features of more correlation with others will be emphasized as a result. This process facilitates the learning of nonlocal high-level representation of image.

In this paper, we propose a novel data analytics oriented approach for image representation learning with self-supervised learning and self-attention for manufacturing systems—TriLFrame. The framework is aimed at learning nonlocal semantic features of image with the ability to predict the missing patches of image in latent space. Although following the idea of contrastive self-supervision, TriLFrame is different from [6, 25] in major aspects: first, TriLFrame applies self-attention mechanism on the top of backbone convolutional operations to capture long-range dependencies; second, TriLFrame makes use of nonlocal image patches with no overlap to construct positive and negative samples for contrastive learning; and third, we introduce a progressive prediction strategy instead of the simple linear transformation used in [6]. It should also be noted that this paper is an extension of our previous work CPCTR [29], which is a self-supervised self-attention framework for video representation learning. CPCTR utilizes self-attention operations to encode long-range spatio-temporal correlations of video data in order to capture “slow features” in video. Different from CPCTR, this paper makes the following novel contributions: first, specifically for image data processing, we design a new self-supervision pretext, which is the first few to introduce self-attention to contrastive image representation learning; second, a novel positive and negative sample construction is designed for contrastive learning which only requires spatial information of data; and third, under the context of manufacturing systems, we conduct experiments on different image datasets to show the effectiveness of the proposed method.

The contribution in this paper is summarized as follows: (1) we propose the self-supervised self-attention coding framework for image learning in manufacturing environment; (2) we apply the transformer encoder to TriLFrame to learn nonlocal spatial dependencies to better learn the semantic representation of image, and we experiment on the self-attention module in TriLFrame to reveal its effectiveness; and (3) we evaluate TriLFrame on the ILSVRC ImageNet competition dataset [30] as many authors [3133]. With unlabeled image data, we show that a pretrained TriLFrame can be easily transferred to image classification tasks with competitive performances.

2.1. Contrastive Learning

Based on the theory of Noise Contrastive Estimation (NCE) [34], contrastive learning uses classification tasks to discriminate positive sample from negative sample. The learning process is greatly dependent on operation on the latent space of input data (i.e., input data is preprocessed to reduce dimension), which forces contrastive learning to pay more attention to semantic structural representations while less attention to low-level pixel-wise features. In order to improve the efficiency of contrastive learning, users are required to carefully select positive and negative samples. Generally, negative samples that are hard to discriminate can improve the learning quality greatly. Contrastive learning has been proven competitive in the contexts of natural language processing [35, 36], audio processing [6], image processing [24, 27], video understanding [37, 38], etc., and a number of researches have been investigating the prospect of contrastive learning using no negative pairs [39] and no momentum encoder [40]. Its performance shows a promising prospect of contrastive learning. Recently, a new contrastive learning approach, Contrastive Predictive Coding (CPC) in [6], proposes an effective framework that can be applied to sequenced data modality, e.g., natural language, audio, video, or image (an image can be cut into a spatial sequence of image patches). CPC encodes underlying shared features that is slowly varying across data sequences and discarding local information. These shared features are called “slow features,” which refers to these features that are slowly varying across time, e.g., the identity of a speaker in an audio signal, an activity carried out in a video, and an object in an image.

2.2. Self-Supervised Learning for Image

With the development of self-supervised learning, especially the wide adoption of contrastive learning, self-supervised methods have shown a promising prospect for the images learning [6, 24, 25, 27, 31, 41]. CPC [6] using self-supervised training on unlabeled ImageNet dataset and fine-tuned with linear classification already outperform the supervised AlexNet [3]; Data-Efficient CPC [25] as an extended work has scaled up CPC and achieved Top 1 accuracy of 71.5% on the image classification task on ImageNet; it also exhibits high data efficiency when fine-tuning with labelled data compared with fully supervised methods. Deep InfoMax (DIM) [7] learns image representations through the internal structure information. A follow-up work of DIM and Augment Multiscale DIM (AMDIM) [8] utilizes invariant features across data augmentations, e.g., color jittering and random cropping; it gets Top 1 accuracy of 68.4% on ImageNet with unsupervised pretraining and evaluated by linear classification task. Contrastive Multiview Coding (CMC) [26] learns representations using different versions of the same image, e.g., image of different angles, as data transformations which the representation should cope with. In the conventional formulation of contrastive learning, the size of minibatch restricts the total number of negative samples; Momentum Contrast (MoCo) [24] effectively lifts this restriction by maintaining a long queue of all negative samples, when training the negative encoder does not update with the positive encoder. The experiment results show that MoCo outperforms supervised models in several downstream tasks on different image datasets. Typically, these image downstream tasks need supervised training with labelled image to achieve good results; however, MoCo shows that the performance gap between supervised and unsupervised methods has largely been closed.

2.3. Self-Attention

TriLFrame also adopts the idea of self-attention mechanism [28], in which we know the emerging transformer architecture. A self-attention operation calculates the response of a position in an input sequence by paying attention to every position in the sequence and uses the average in the representation space, resulting each response being embedded with correlations with every other position regardless of their distance. Self-attention also carries a major merit, i.e., self-attention module can be calculated simultaneously, which dramatically accelerate the training process. Self-attention, or transformer, already becomes the de facto standard for natural language processing (NLP) tasks [42, 43], and recently, many researches explore their application in computer vision, e.g., object detection [44], image classification [45], video classification [46], and video segmentation [47]. Vision Transformer (ViT) [45] is constructed with pure transformer encoders and is applied directly to a sequence of image patches; ViT achieves competitive results on image classification tasks. After training ViT on large-scale image datasets and transferred to image recognition benchmarks, ViT gets remarkable results compared to state-of-the-art CNNs. However, there is a major drawback of ViT, i.e., it requires substantially more image data and computational resources to train compared to CNNs. Thus, the self-attention architecture (just the transformer encoder) is applied to a sequence of patch embeddings of image (preprocessed by CNNs, e.g., ResNet [2]), aiming to learn nonlocal correlations to implement the learning of semantic structural representation image.

2.4. Image Classification

Traditional image classification architectures [2, 3, 48] take advantage of convolutional networks for processing images and get remarkable performance. Convolutional network is still the de facto standard for image processing tasks and has been implemented in many practical applications. In recent research, new architectures, e.g., networks using transformer [45, 49, 50] or multilayer perceptron (MLP) [5153], are challenging the leading position of CNNs. We also note that hybrid CNN-Transformer architecture [5457] argues that the combination of local convolutional operation and nonlocal self-attention operation is the optimal solution for computer vision tasks. All recent works are trying to break the limitation of CNNs. Here, in our work, RGB image data is utilized to train a hybrid CNN-Transformer architecture in a self-supervised manner, and then, the model is fine-tuned for image classification tasks; it is also an attempt to explore new framework of image processing.

3. Self-Supervised Self-Attention Learning Method

In this section, the core components and implementations of TriLFrame which include the learning framework, the sample construction, and the self-attention module are presented.

3.1. Framework

The aim of TriLFrame is to learn a nonlocal semantic representation of image. The image is first processed by a convolutional operation, and then, patches of latent representations are unfolded for self-attention operation. As illustrated in Figure 3, TriLFrame takes an RGB image as input and unfolds the latent embedding into a number of patches (16 patches as in the experiment), given the former part of patches the TriLFrame predicts the latter part of patches. We use the latter part patches and the predicted patches to construct positive and negative samples for contrastive learning. First step, the RGB image is preprocessed, and a convolutional operation computes the image embedding : where has dimension . As same as ViT, we break apart the latent embedding along the spatial dimension to get patch 1 to patch 16, where each patch is named as , and we have . These patches are sent to the self-attention function to compute nonlocal correlations: where .

Afterwards, patches are accumulated along the patch sequence by an aggregation function to get a context :

The accumulated context is of the same dimension as and . In our initial settings, .

If feature vector have embedded with semantic structural features, e.g., the key features of image patch and have been aware of the correlation with other patches, then the accumulated context can predict embeddings of the rest patches by using a simple inference function : where is the predicted embedding of patch . As instructed in Seq2Seq [58], we infer future embeddings in a sequential mode. For the prediction of patch , the context which accumulates every past embeddings including the latest predicted , as illustrated in Equation (5). We totally infer patches for one image.

3.2. Contrastive Learning

In TriLFrame, contrastive learning is implemented by discriminating positive “Predicted Patch - Ground-truth Patch” sample pair (named “Pred-GT” sample pair) and negative Pred-GT sample pair. Follow the idea of NCE [34] and CPC [6], a NCE variant is adopted as our contrastive loss for contrastive image learning. The proposed contrastive loss draws the predicted patches closer to the ground-truth patches while the Pred-GT sample pair does not need to be completely the same, i.e., model just needs to learn nonlocal semantic structural representation without paying attention to pixel-level details or noises.

As illustrated in Figure 4, the red arrow line connects the only positive sample pair, and the dashed black arrow line shows two negative sample pairs constructed by (1) the predicted patch embedding and a random ground-truth patch embedding of the same image and (2) the predicted patch embedding and the patch embedding of an image from the same minibatch. In TriLFrame, we break down an image representation of dimension into patches, for -th image patch; the ground-truth latent embedding is couple with its predicted latent embedding ; both embeddings are of the same dimension . As illustrated in Figure 4, we construct positive sample pairs with a prediction embedding and its corresponding ground-truth embedding and negative sample pairs with a prediction embedding and ground-truth embeddings at other spatial positions of the same image. We further utilize patch embeddings of images from the same minibatch to produce more negative Pred-GT pairs for contrastive learning.

The similarity score of the Pred-GT sample pair is calculated by dot product as where , and denote -th and -th patch (). Hence, TriLFrame is to optimize the contrastive loss:

The loss function in the above equation is typically the cross-entropy loss for distinguishing positive Pred-GT sample pairs from negative sample pairs. When training with minibatch, we define the following types of negative Pred-GT sample pairs to define the construction of negative samples: (i)Easy Negatives. In the same minibatch, easy negatives are Pred-GT sample pairs from two images. Easy negatives are relatively easy to discriminate in general, but there exist similar patches from different images; for instance, image patches both contain a football(ii)Spatial Negatives. In the same image, spatial negatives are constructed with predicted patch and ground-truth patch embeddings, where the two patches are at different spatial locations in the image, i.e., (, ) pair with

3.3. Transformer Encoder

We implement the conventional transformer architecture [28] as the transformer in TriLFrame except that the transformer decoder and the positional encoding are discarded. Although it is proven that positional embeddings make self-attention operation be aware of sequential information to some degree [44, 45], when using self-attention on image patches, the goal is to embed nonlocal correlations between patches of image, so it is not important to be aware of sequential information of patches.

As illustrated in Figure 5, the conventional transformer encoder operation gets a one-dimensional sequence of patch embeddings. To make the transformer applicable in three-dimensional latent embeddings of images, we break apart the image embedding along the spatial dimension to a series of one-dimensional patches . Transformer encoder takes a total number of patches as input. In our implementation, the transformer encoder is repeated times, with a shared feature vector dimension at all layers. Each layer has one multihead self-attention operation and one MLP operation, where the number of heads is . After each transformer encoder, we get an output . Through the transformer encoder operation, each patch of image links to every other patch; thus, the nonlocal spatial dependencies are computed.

3.4. Image Processing Workflow

Due to the diversity of images generated by manufacturing systems, e.g., images of different resolution, images from different angles, panoramas or close shot, grey-scale image or RGB image, infrared images, and medical images (magnetic resonance imaging images, CT images), image data must be preprocessed for self-supervised learning afterwards. To help the model learn nonlocal semantics, we deploy the following frame-wise augmentation methods to every image in a minibatch, such as color jittering which includes random contrast, random brightness, random hue, random saturation, and random greyscale during self-supervised training. It is noted that, by introducing self-attention, in contrast to CPC, TriLFrame does not require image patches to be overlapped; this effectively avoids the network to perform feature extrapolation as the prediction.

In contrast to the one-off prediction of the latter patches presented in [6], we implement a successive predictive mechanism (i.e., latter patches are predicted in a progressive manner). As described in Equations (4) and (5), all previous context of the image (an aggregated context) is utilized to make the next inference. This successive prediction process ensures that the model makes use of every previous image patch when predicting the next patch embedding.

Batch normalization [59] (BN) is a conventional practice in deep neural networks; however, it is not adopted in CPC [6]. We argue that BN is necessary in TriLFrame, and it gives 2%-4% accuracy improvement in classification tasks we performed. It is difficult to train a hybrid CNN-transformer network without normalization either in self-supervised training stage or supervised fine-tuning stage. In this paper, BN is adopted for convolutional function and transformer encoder.

4. Experiments and Analysis

We show the experiment setups and the self-supervised training procedure in the following section, and then, we show the ablation study of TriLFrame and the evaluation of the model.

4.1. Experiment Setting
4.1.1. Network Architecture

A conventional ResNet [2] is implemented as the convolutional operation ; ResNet consists of four residual blocks wrapping up with final channel dimension of 256. We use the output from the fourth residual block as the input to transformer encoder. In our experiments, ResNet18 is implemented. After ResNet encoder, the latent representation of an image is cut into a sequence of patches and then processed by transformer as [28] without the positional encoding module. Taking account of the number of the image patches, we set the number of attention heads and encoder layers to 2 and 1, respectively, in our experiments. This setting of transformer also forces encoder to learn a better quality of semantic structural representation of image, i.e., in order to train a strong feature encoder, a weak self-attention operation and aggregation operation is preferable. Thus, we apply a simple Convolutional Gated Recurrent Unit with the smallest kernel size (1, 1) as our aggregation operation. For inference, one simple MLP is applied in a progressive manner.

4.1.2. Self-Supervised Training

In our experiments, we use the ILSVRC ImageNet competition dataset [30] for self-supervised training. The ImageNet dataset has been used to evaluate unsupervised vision models by many works. Before encoder function , images are preprocessed for data augmentation; we implement random grey, random flip, random crop, and color jittering for each image before feeding to ResNet18. These augmentations help the network to avoid shortcuts, i.e., feature extrapolation, as discussed in Section 3.4.1. For self-supervision, we train TriLFrame end-to-end with Adam optimizer; we start with an initial learning rate of 10-3 and weight decay rate of 10-5. The learning rate is decayed by a linear function every 100 epoch and is settled at the rate of 10-5.

4.2. Evaluation Methods
4.2.1. Self-Supervised Learning Evaluation

The TriLFrame is first training with self-supervision on the ILSVRC ImageNet. Self-supervised training is initially evaluated by the validation Top 1 accuracy, i.e., the Top 1 accuracy of classifying the positive Pred-GT sample pairs from others in the validation set. Self-supervision with high validation accuracy tells that the model learns a good distribution of image embeddings. After self-supervised training, TriLFrame is further evaluated by downstream tasks, especially the image classification task on the ILSVRC ImageNet. TriLFrame is fine-tuned with a simple classification layer on the ILSVRC ImageNet in a supervised manner; after model converges, TriLFrame is then evaluated by the classification task on the ILSVRC ImageNet. We report all accuracy results as Top 1 accuracies.

4.2.2. Image Classification

Image classification is an important metric for evaluation on the self-supervised image learning approaches; thus, we take the image classification task to evaluate the TriLFrame.

After contrastive learning, the model should be able to encode the semantic structural representation of an input image from any source in manufacturing systems, and the image representation can then be used in classification task. The last aggregated context representation is utilized to construct the image classification network as follows: at first stage, we first encode an image with the convolutional operation to get the latent embedding , which is subsequently broken into a sequence of image patches with no overlapping. The patch sequence is then sent to the self-attention operation to capture nonlocal dependencies, and finally, we use the aggregation function to aggregate the whole sequence of patch embeddings into a context representation which is a feature vector of the image. At second stage, the representation is passed to a FC layer and a Softmax function to get the probabilities for image classification. The classification network is trained by Adam optimizer; we start with the learning rate of 10-3 and a weight decay rate of 10-3. Because TriLFrame is fine-tuned for image classification, so the initial learning rate of the convolutional operation , the self-attention operation , and the aggregation operation is set to 10-4. At prediction stage, an image is preprocessed except for random crop. The final classification result is given by the Softmax probability.

4.3. Performance Analysis
4.3.1. Ablation Study

We conduct several ablation studies on the TriLFrame architecture, especially on the backbone encoder and transformer encoder module, to show the contribution of each module of TriLFrame and the effectiveness of deeper convolutional encoder. The ablation study is first conducted with ResNet18 as encoder, and the results are given in the upper part of Table 1. The baseline model is set up with random initialization and trained only in supervision for image classification on the ILSVRC ImageNet dataset. TriLFrame is pretrained with contrastive learning and fine-tuned with supervised learning on ILSVRC. We observe that Top 1 accuracy increases from 54.1% to 75.6%, after TriLFrame is pretrained with self-supervision. Also, it is obvious that after removing the transformer module , the Top 1 accuracy incurs a significant drop: from 75.6% to 66.8%. This ablation study demonstrates that the TriLFrame framework is effective in capturing the semantic representation of image, and that the self-attention module plays an important role in the framework.

We also try to use deeper convolutional networks, e.g., ResNet50 and ResNet101. The results are given in the lower part of Table 1. It empirically shows that deeper convolutional encoders contribute to better self-supervised accuracy as well as better image classification accuracy. When adopting ResNet101 as backbone encoder, we observe that Top 1 accuracy of classification accuracy reaches 81.2%, which is a considerable increase compared with 75.6% by ResNet18 and 78.3% by ResNet50. This ablation study proves that deeper convolutional encoder plays a more effective role in TriLFrame architecture.

4.3.2. Self-Supervision Accuracy Compared with Classification Accuracy

We also conduct experiments to show the correlation of self-supervision and image classification tasks. During self-supervised training TriLFrame on the ILSVRC dataset, we perform several early stops at different self-supervision accuracy and fine-tune with supervision for image classification. Then, we demonstrate the relationship of self-supervision accuracy and image classification accuracy. For simplicity, the correlation experiments are conducted with ResNet18 as backbone encoder. The following Figure 6 shows our findings.

As shown in Figure 6, TriLFrame is trained using self-supervision on the ILSVRC dataset and is early-stopped at validation Top 1 accuracy of {54.8, 64.0, 71.4, 80.9}; for each early stop, the model is then fine-tuned with supervision for image classification. It is obvious that the performance of TriLFrame on downstream classification task is dependent on the self-supervision accuracy, i.e., TriLFrame of higher self-supervision accuracy yields higher accuracy of image classification. This correlation of self-supervision accuracy and classification accuracy shows that the image representation learnt in self-supervised training effectively capture semantic structural features which is generalized and can be used at downstream classification tasks.

4.4. Comparison with State-of-the-Art Methods

We show the comparison with state-of-the-art self-supervised methods using linear probing. Results of state-of-the-art methods are reported by the implementation in [39, 60, 61]. For fair comparison, all methods are pretrained with image crop from the ILSVRC ImageNet dataset; data augmentation is applied accordingly. Note that we try ResNet18, ResNet50, and ResNet101 as backbone encoder in TriLFrame to show comparable results, which is different from other model settings. The results are given in Table 2. It shows that: first, the proposed TriLFrame framework outperforms the state-of-the-art method with 81.2% Top 1 accuracy, but it has a relatively large parameters of 485 M compared with other methods listed in the table. Second, when constructed with a shallow convolutional network ResNet18, TriLFrame is a quite light-weight architecture of only 75 M parameters, which is significantly less than other methods with close performance (CMC has a 94 M parameters but only 70.6% accuracy). This proves that the hybrid CNN-Transformer architecture which combines local convolutional operation and nonlocal self-attention operation is quite effective and efficient in capturing nonlocal semantic image features.

4.5. Transfer to Other Image Classification Tasks

Another important experiment to test whether self-supervised learning captures key semantic features is transfer learning. Model is first trained with self-supervision on the ImageNet; when the model settles, we follow the classification model setting as described in Section 4.2.2 except that the model will be fine-tuned end-to-end for transfer learning. Specifically, the parameters learned from self-supervision on ImageNet dataset are used to initialize the classification model, and then, the entire model will be trained with supervised learning on other image datasets. We follow the transfer learning settings and evaluation protocol in [39], and we use image datasets CIFAR [62], VOC2007 [63], Pets [64], and Flowers [65]. If features learned from self-supervision are generic and contain key semantic features, then they should be helpful in other the image datasets mentioned above. Please note that TriLFrame uses ResNet50 as encoder, which matches the conventional architecture setting. The transfer learning results are given in Table 3.

Although TriLFrame only surpasses other methods in one of the five classification tasks, it achieves competitive performances compared with state-of-the-art methods. We note that TriLFrame performs better with larger dataset; for example, TriLFrame achieves the highest accuracy on CIFAR100 and the second highest accuracy on CIFAR10; both datasets have more training samples than VOC2007, Pets, and Flowers. This characteristic of TriLFrame accords with ViT which also requires large amount of training data to get competitive performances.

4.6. Discussion

Through ablation study on the TriLFrame framework and comparison with SOTA models, we believe that the following factors contribute to the achievements of TriLFrame: First, the contrastive self-supervised training process forces the model to learn a strong semantic structural embedding of image through predicting nondeterministic image patch embeddings given previous context knowledge of the image. Second, it is apparent that the hybrid CNN-Transformer architecture which combines local convolutional operation and nonlocal self-attention operation is quite effective and efficient in capturing nonlocal semantic image features. Third, we trust that the framework design of TriLFrame, which makes use of convolutional operation, self-attention operation, and aggregation operation, shows great performance and suits image learning tasks.

5. Conclusion

In this paper, we propose a novel self-supervised self-attention framework TriLFrame for image representation learning in manufacturing systems. TriLFrame combines the powerful local convolutional operation and the long-range nonlocal attention operation; TriLFrame learns image representation through contrasting predicted image patches and ground-truth image patches. We show that the proposed TriLFrame achieves state-of-the-art performances on image classification task on the ImageNet, with a Top 1 accuracy of 81.2%; TriLFrame also achieves competitive performance with a light-weight architecture of only 75 M parameters. When tested in transfer learning, TriLFrame is proven to be reliable in capturing semantic features for image classification tasks.

The work demonstrated shows that TriLFrame has a promising future in image-related tasks in manufacturing systems; it can be quickly transferred and deployed to applications such as anomaly detection, medical diagnosis, and road analysis. Nevertheless, TriLFrame requires extra supervision information (e.g., image labels and segmentation information) if it is to be deployed in manufacturing systems for a specific task, and the supervision information may require exquisite design and significant amount of effort. Therefore, we hope the proposed TriLFrame can be considered as a baseline or backbone framework when solving image-related tasks in manufacturing systems in the future.

Data Availability

The data that support the findings of this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Acknowledgments

This work was supported by the Education Department of Jiangxi Province of China (No. GJJ204912) and the Science and Technology Bureau of Ganzhou City of China (No. [2020]60).