Abstract

Texture classification plays an important role for various computer vision tasks. Depending upon the powerful feature extraction capability, convolutional neural network (CNN)-based texture classification methods have attracted extensive attention. However, there still exist many challenges, such as the extraction of multilevel texture features and the exploration of multidirectional relationships. To address the problem, this paper proposes the compressed wavelet tensor attention capsule network (CWTACapsNet), which integrates multiscale wavelet decomposition, tensor attention blocks, and quantization techniques into the framework of capsule neural network. Specifically, the multilevel wavelet decomposition is in charge of extracting multiscale spectral features in frequency domain; in addition, the tensor attention blocks explore the multidimensional dependencies of convolutional feature channels, and the quantization techniques make the computational storage complexities be suitable for edge computing requirements. The proposed CWTACapsNet provides an efficient way to explore spatial domain features, frequency domain features, and their dependencies which are useful for most texture classification tasks. Furthermore, CWTACapsNet benefits from quantization techniques and is suitable for edge computing applications. Experimental results on several texture datasets show that the proposed CWTACapsNet outperforms the state-of-the-art texture classification methods not only in accuracy but also in robustness.

1. Introduction

Texture classification is crucial in pattern recognition and computer vision [15]. Since many very sophisticated classifiers exist, the key challenge here is the development of effective features to extract from a given textured image [6]. As an important research issue, many methods have been proposed to represent texture features [7, 8]. About 51 different sets of texture features are summarized in [9]. These texture features are generally hand-crafted under some hypothesis of texture characteristics. Because different texture datasets contain different types of textures, the performance of hand-crafted features is usually changed for different datasets [4].

Recently, texture representation methods based on CNN have been achieved powerful representation capability [6, 1013]. These CNN-based methods implement texture feature extraction in an end-to-end way which does not require predefined representation formula. Moreover, Fujieda et al. [11] find that integrating wavelet transform into CNN can effectively capture spectral information of texture images. Nevertheless, there still exist many challenges, such as extracting multilevel texture features and capturing sufficient relationships [11]. Pooling operations of CNN-based methods proactively discard substantial information which prevents the efficient exploration for texture feature relationships [14, 15]. In contrast, the capsule neural network (CapsNets), which implements dynamic routing algorithm instead of traditional pooling mechanism, can probably get rid of the weakness of pooling operations. In addition, CapsNets replaces scalar outputs of CNN with more informative vector outputs obtained by the squashing activation function. CapsNets has several advantages, such as relationship awareness and stable generalization capability [16, 17].

The attention mechanism [18] is proposed to help models to focus on more relevant regions, capture complex correlations, and discover new patterns within images. Therefore, the integration of attention mechanism and CapsNets has great potential to represent texture features and explore their relationships sufficiently. The key problem preventing attention mechanism and CapsNets from being applied in edge computing domain is that they both suffer from heavily computation and memory burdens. It is essential to consider the quantization techniques for deploying models on edge devices [1933]. This paper proposes the compressed wavelet tensor attention capsule network (CWTACapsNet) that integrates multilevel wavelet decomposition, tensor attention mechanism, and quantization techniques into the capsule network.

The proposed CWTACapsNet involves several compressed multiscale tensor self-attention blocks that can capture multidirectional dependencies across different channels. Furthermore, CWTACapsNet utilizes Nyström technique and proposes quantized dynamic routing process to release resource requirements. The main contributions of CWTACapsNet are three folds. First, it uses multilevel wavelet transform to extract multiscale spectral features in frequency domain which further extends texture representation capability. Second, it employs tensor attention mechanism via matrization to explore the multidirectional dependencies of texture features in different scales. Third, it employs quantization techniques to reduce the computation and memory costs without sacrificing the accuracy.

The rest of the paper is organized as follows. Section 2 presents the whole architecture and key parts of the proposed CWTACapsNet. Section 3 presents validation experiments and discusses the experimental results. The conclusion is drawn in Section 4.

2. Compressed Wavelet Tensor Attention Capsule Network

The proposed CWTACapsNet integrates multiscale wavelet decomposition and tensor self-attention blocks into capsule network. The architecture of CWTACapsNet is shown in Figure 1. CWTACapsNet involves the wavelet feature extraction block, the compressed multiscale tensor self-attention block, and the quantized capsule network. The wavelet feature extraction block extracts multiscale spectral features with multilevel wavelet decomposition. The compressed tensor self-attention block captures the multidirectional relationships within each scale, and the primary capsules are generated based on the wavelet and tensor attentive information.

2.1. Multiscale Feature Extraction via Wavelet Decomposition

Given an image , we utilize the 2D discrete wavelet transform (DWT) [34] with four convolutional filters, i.e., low-pass filter, , and high-pass filters, , , and , to decompose into four subband images, i.e., , , , and . The convolutional stride is 2. The four filters are defined by

The four filters (see equation (1)) indicate that they are orthogonal to each other and form a 4 × 4 invertible matrix. The DWT operation is given bywhere denotes convolution operator and denotes the downsampling with stride 2. The (i, j)-th value of and after 2D Haar transform [19] is given by

Based on multilevel wavelet package transform [35], the subband image is recursively decomposed by DWT. Because the downsampling stride is 2, the sizes of extracted subband images in different wavelet decomposition levels are dimidiate gradually. In addition, the upsampling operations (with stride (2) are employed to guarantee the size consistency of convolution feature maps for tensor concatenation.

2.2. Compressed Tensor Self-Attention Block

Inspired by [36], we design the compressed tensor self-attention block based on matricization and Nyström technique. The matricization can capture interdependencies along all dimensions of tensorized convolution feature maps. To reduce the computational and storage requirement of attention computation, we use Nyström technique to achieve an approximation solution which releases the resource burden of inference and speed up significantly.

These tensorized convolution feature maps are generated based on wavelet-extracted features. The input 3rd-order tensor can be viewed as a combination of its three mode-matricizations. Combining their outputs allows the compressed tensor self-attention block to make use of interchannel and intrachannel interdependencies. Moreover, the Nyström-based self-attention module involved in the compressed tensor self-attention block implements the self-attention computation to explore dependencies along corresponding mode in a more efficient way. The architectures of the compressed tensor self-attention block and the Nyström-based self-attention module are shown in Figure 2.

A mode-n-matricization of 3rd-order input tensor , , is the vector obtained by fixing all indices of except for the nth dimension and can be seen as a generalization of matrix’s rows and columns, . The mode-n-matricization of 3rd-order tensor is a case of matricization denoted as and arranges its mode-n-fibers to be the columns of the resulting matrix.

To simplify notations, we ignore the subscript. Let be the input matrix of the self-attention module, and it is projected using three matrices , , and to extract feature representations , , and as follows:

The output of the self-attention module is computed bywhere denotes the learnable coefficient and denotes a row-wise softmax normalization function. Then, generate tensor Y by reshaping O as tensor form.

As shown in equation (5), the self-attention mechanism requires calculating similarity scores between each pair of vectors, resulting in a complexity of for both memory and time. Due to this quadratic dependence on the input length, the application of self-attention is limited to small size matrices (e.g., ) for edge devices. It is necessary to reduce the resource burden. Inspired by [37], we utilize Nyström technique to build a resource-efficient self-attention module. We rewrite the softmax operation in equation (5) as follows:where , and . denotes the selected matrix generated by sampling m columns and m rows from matrix S via some adaptive sampling strategy [38].

According to the Nyström method [37, 38], S can be approximated bywhere denotes the pseudoinverse (Moore–Penrose inverse) of . is approximated by .

The SVD of can be written as , where and denote orthogonal unitary matrices, and denotes the diagonal matrix whose diagonal elements are corresponding singular values of . Then, pseudoinverse can be computed by

Submitting equations (8) into (7), we obtain

From equations (6)‒(9), we can find that requires all entries in due to the softmax function, even though the approximation only needs to access a subset of the columns and rows of S, e.g., corresponds to the first m columns of S (see equation (6)) and corresponds to the first m rows of S. An efficient way is to approximate using subsampled matrices instead of whole one (i.e., the matrix ). Let denote the matrix that consists of m columns of , and denote the matrix that consists of m rows of . Then, we compute the approximations as follows:

Based on equations (7)–(9), we can obtain the efficiently approximated as follows:where and are selected before the softmax operation, which means can be computed only using small submatrices instead of the whole one (the matrix ).

The output of each single compressed tensor self-attention module is computed by

Then, the output of the compressed tensor self-attention block (Figure 2(a)) can be generated bywhere denotes a reshape function which rearranges the matrix as the tensor of dimension , , and denotes the matrix concatenation operator.

2.3. Quantized Capsule Network

Aiming to overcome the deficiency and shortcoming of convolutional neural networks, a novel architecture of neural network called capsule networks was first introduced by Geoffrey Hinton [14]. A capsule is a set of neurons represented as a vector. The individual values are to capture features of an object, while the length of the vector shows the capsule activation probability. The first layer of capsules comes from the output of a convolution. This output is rearranged into vectors with a previously specified dimension (and is shrunk using the squashing function), which are used to compute the output of a next layer set of capsules. The algorithm with which the next layer capsules are computed using the current layer of capsules outputs is called dynamic routing. It takes predictions from the current layer capsules about the output of the next layer capsules and computes the actual output according to an agreement metric between predictions.

It should note that the superiority of capsule network leads to the heavy burden of computation and storage. To address this problem and make it easy to deploy on edge computing devices, we integrate share structure and quantization technique into capsule network and propose the quantized capsule network.

As shown in Figure 1, the input of quantized capsule network is generated by concatenating the output tensors of multiple compressed tensor self-attention blocks through some upsampling operation. From these concatenated tensors, the quantized convolutional layer extracts basic features. The primary capsule layer explores more detailed patterns from the extracted basic features:where denotes the output of capsule i in the primary capsule layer, p denotes the dimension of primary layer capsule vector (or capsule vector length), denotes the number of capsules in the primary capsule layer, denotes the function that reshapes the output tensors into capsule vectors (the detailed description is provided in [14]), and denotes the quantized convolution operator (the detailed derivation of quantized convolution is provided in [39]).

Generally, the prediction vector generated by the primary layer capsule i, , indicates how much the primary layer capsule i contributes to the class layer capsule j. is given bywhere denotes the weight matrix between the primary layer capsule i and the class layer capsule j, c denotes the dimension of the class layer capsule vector, p denotes the dimension of primary layer capsule vector (or capsule vector length), and and denote the numbers of capsules in the class capsule layer and the primary capsule layer, respectively.

From equation (15), we can find that there are weight matrices , which leads to heavy computation and memory burden. To reduce the burden, we adopt two strategies. First, we utilize the shared structure of weight matrices (shown in Figure 3) aswhere denotes the transformation weight matrix corresponding to class layer capsule j (i.e., each class layer capsule shares its weight matrix to all primary layer capsules). Equation (16) indicates that the number of weight matrices is reduced from to .

Second, we propose the quantized dynamic routing process that implements the dynamic routing in a more efficient way (shown in Figure 4). For simplicity, we assume that p can be divided by with no reminder and . Let where denotes the -th submatrix of , . We train subcodebook for subspaces of weight matrices as follows:where denotes the subcodebook consists of K subcodewords for , and denotes the indexing matrix, and each row of only has one nonzero entry which specifies the quantization relationship between subvector and subcodeword. The alternative optimization algorithm, such as k-means clustering, is employed for learning and .

Let where denotes the -th subvector of , . We train subcodebook for subspaces of primary layer capsule vectors as follows:where denotes the subcodebook consists of K subcodewords for , and denotes the index vector that only has one nonzero entry which specifies the quantization relationship between subvector and subcodeword. The alternative optimization algorithm, such as k-means clustering, can be employed for learning and.

Combining equations (17) and (18), we can rewrite equation (16) aswhere

It is obvious that there are many replicate elements in the product of , after the parameter quantization. Therefore, it is unwise to compute the products in a one-by-one style. Instead, we first compute the results of the product , i.e., constructing the lookup table, as follows:where , .

Then, in the application, we can look up the precomputed table instead of repeatedly computing which raises computational speed significantly. Hence, we can rewrite equation (19) as follows:where the product can be considered as the process of looking up the precomputed table instead of the matrix multiplication operation.

According to the mechanism of capsule network, the input vector of class layer capsule j can be computed bywhere denotes the coupling coefficient determined by the iterative dynamic routing process (see Table 1). The routing part is actually a weighted sum of with the coupling coefficient. The output vector of class layer capsule j is calculated by applying a nonlinear squashing function that can ensure short vectors to be shrunk to almost zero length, and long vectors get shrunk to a length slightly below one aswhere denotes the output vector of class layer capsule j.

Obviously, the capsule’s activation function actually suppresses and redistributes vector lengths. Its output can be used as the probability of the entity represented by the current class capsule. The quantized dynamic routing algorithm is shown in Table 1.

We construct the whole loss function of the proposed CWTACaps by integrating the margin loss [14], reconstruction loss [14], and the quantization loss as follows:where and denote positive coefficients and , , and denote the margin loss function, the reconstruction loss function, and the quantization loss function, respectively. They are defined by equations (25)‒(27) as follows:where iif correct classification, and , and denote positive coefficients, usually selected as 0.5, and denotes the reconstructed image.

3. Experiments

The aim of this section is to validate our proposed CWTACapsNet on three datasets: CUReT [40], DTD [41], and KTH-TIPS2-b [42]. For the CUReT dataset, we use the same subset as in [43], which contains 61 texture classes (92 images per class). The DTD dataset contains 47 classes (120 images per class). KTH-TIPS2-b contains 11 classes. Each class in KTH-TIPS2-b contains 432 images which are resized to 256 × 256 pixels. Besides CWTACapsNet, five state-of-the-art methods, T-CNN [13], FV-CNN [8], SI-LCvMSP [1], Wavelet CNNs [11], and CapsNet [44], are employed for performance comparison.

Models in experiments are trained under Ubuntu 16.04 with i7-8700 CPU, 64G RAM, and GeForce GTX Titan-XP GPU, and our proposed CWTACapsNet is deployed on Jetson TX2. To provide a direct comparison with published results, parameters of five state-of-the-art methods are set according to previous studies [1, 8, 11, 13, 44]. We use an exponential decay learning policy, with an initial learning rate of 0.001, 2000 decay steps, and 0.96 decay rate. We employ Adam optimizer to adjust the weights of CWTACapsNet in the training process. The batch size is set as 32. We implement data augmentation through rotating images with a random angle between 0° and 90°. We use 3 routing iterations to update capsule parameters in CWTACapsNet. The number of wavelet level in CWTACapsNet is selected according to the tradeoff between validation accuracy and network parameter amount. We thus choose 3-level wavelet decomposition. The learnable coefficient is selected as 0.1, and and are selected as 0.001 and 0.0013, respectively.

Table 2 illustrates classification accuracies and standard derivations of six methods. Table 2 indicates that CWTACapsNet achieves the best performance and is more stable than other methods. The tensor attention block makes CWTACapsNet be able to capture multidirectional dependencies while other methods cannot. FV-CNN performs better than CapsNets. FV-CNN and CapsNets both deal with pooling operation, and FV-CNN has some specific design to capture texture information. CNN-based texture classification methods tend to be limited by the lack of diversity of convolution filters. The multilevel wavelet decomposition extends both spatial and frequency features, which raises diversity of convolution filters and improves performance.

We add 10% white noise into texture datasets to evaluate robustness. Table 3 shows the performance of noisy datasets. Figure 5 shows accuracy for pure and noisy data. Figure 6 shows accuracy standard derivations (std) for pure and noisy data.

From Table 3 and Figures 5 and 6, we can find that CWTACapsNet achieves the best accuracy and robustness. Although CapsNets and CWTACapsNet are both based on capsule layer, CWTACapsNet significantly outperforms CapsNets. The memory requirement of CapsNets in the experiments is about 272M, while our proposed CWTACapsNet only requires 23.2 M with about 10 × speed-up. CWTACapsNet can be deployed and run on Jetson TX2, while CapsNets requires too much resource that Jetson TX2 hardly supported. The superiority of CWTACapsNet relies on three factors: the multilevel wavelet decomposition extends features from spatial space to frequency space, the tensor attention block explores relationships from all possible directions and captures the dependencies cross channels, and the quantized dynamic routing significantly reduces memory requirement. Experimental results validate the effectiveness of CWTACapsNet.

4. Conclusion

In order to make capsule network efficiently explore spatial and spectral features and capture multidirectional channel dependencies, this paper proposes a novel capsule network named compressed wavelet tensor attention capsule network (CWTACapsNet). In CWTACapsNet, the compressed multiscale wavelet transform is designed to extract multiscale spectral features in frequency domain; the tensor attention blocks utilize matrization to capture multiple directional dependencies across convolutional channels in terms of each scale information; furthermore, we propose quantized dynamic routing process for speeding up and storage reduction. Experimental studies have shown that the proposed CWTACapsNet provides the best performance on both classification result and antinoise robustness; moreover, CWTACapsNet significantly reduces the computational and storage complexities. In the future, we will incorporate parallel computation methods into CWTACapsNet to further improve efficiency.

Data Availability

The authors approve that data used to support the finding of this study are publicly available. The datasets can be achieved from the links provided by [4042]. CUReT Dataset is available at https://www.cs.columbia.edu/CAVE/software/curet/html/download.h

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Chinese National Natural Science Foundations (nos. 61972104 and 61571141).