Abstract
Denoising Autoencoder (DAE) is one of the most popular fashions that has reported significant success in recent neural network research. To be specific, DAE randomly corrupts some features of the data to zero as to utilize the cooccurrence information while avoiding overfitting. However, existing DAE approaches do not fare well on sparse and high dimensional data. In this paper, we present a Denoising Autoencoder labeled here as InstanceWise Denoising Autoencoder (IDA), which is designed to work with high dimensional and sparse data by utilizing the instancewise cooccurrence relation instead of the featurewise one. IDA works ahead based on the following corruption rule: if an instance vector of nonzero feature is selected, it is forced to become a zero vector. To avoid serious information loss in the event that too many instances are discarded, an ensemble of multiple independent autoencoders built on different corrupted versions of the data is considered. Extensive experimental results on high dimensional and sparse text data show the superiority of IDA in efficiency and effectiveness. IDA is also experimented on the heterogenous transfer learning setting and crossmodal retrieval to study its generality on heterogeneous feature representation.
1. Introduction
Denoising Autoencoder (DAE) [1–5] is an extension of the classical autoencoder [6, 7], where feature denoising is key for the autoencoder to generate better features. In contrast to the classic autoencoder, the input vector in DAE is first corrupted by randomly setting some of features to zero. Then attempts are made to reconstruct the uncorrupted input from the corrupted version. Operating based on the principle of predicting the uncorrupted values from the corrupted input, DAE has been shown to generalize well even with noised input. However, DAE and its variants do not fare well on high dimensional and sparse data since many features are already zeros in nature, and any further reset of the feature vector has no more effects on the original data. Moreover, high dimensional data also lead to uneven distribution of uncorrupted features. To address the above challenges, in this paper, we propose a denoising scheme that is designed for high dimensional and sparse data, which is labeled here as the InstanceWise Denoising Autoencoder (IDA). To be more specific, if one nonzero feature of the instance is chosen, then this instance will be removed totally. That means that many instances will be removed from the data. Obviously, this will lead to serious information loss. Therefore a recovery strategy is further adopted where multiple independent autoencoders are constructed based on different versions of corrupted inputs and then combined to obtain the final solution. In IDA, instances are directly dropped and thus can reduce the training data size significantly. Obviously, this will be considerably useful to large scale data analytics. Additionally, autoencoders in the model are independent of data retrieval level to command execution level, and this leads IDA natural to be parallelized and carried out on a single multicores CPU computer or a distributed computing platform. In the paper, we verify the performance on classic high dimensional and sparse text data. The experimental results show that the proposed autoencoder is very fast and effective.
Furthermore, we study IDA’s application on heterogenous feature representation and propose the HeterogenousSIDA based on the heterogenous feature fusion framework [8]. Experiments on transfer learning and crossmodal retrieval show that IDA can obtain better performance than mSDA which is embedded autoencoder of the fusion framework [8].
The core contributions of the current paper are as follows: (i) an InstanceWise Denoising Autoencoder (IDA) method is proposed improving generalization performance and efficiency. (ii) A procedure for building a fast deep learning structure rapidly via stacking IDA for large scale high dimensional problems is proposed. (iii) The deep learning approach is further introduced to two heterogenous feature learning tasks including crosslanguage classification and crossmodal retrieval.
2. Review on Denoising Autoencoder
In a classic autoencoder [6], the aim is to learn a distributed representation that captures the coordinates along the core factors of variation in the data. As shown in Figure 1(a), an autoencoder takes in input and maps it to a hidden space through a deterministic mapping with weights in the step called . Then, in the step, the latent space or code is mapped back into a reconstructed feature space that bears the same shape as through a similar transformation . The parameter of this model ( denotes the transposition of ) is optimized such that the average of reconstruction error is minimized. The considered reconstruction error can be the crossentropy loss [9] or the squared error loss as follows:
(a) Classical autoencoder
(b) Denoise autoencoder
However, the basic autoencoder alone is not sufficient to be the basis of a deep architecture because it has a tendency of overfitting. In other words, the reconstruction criterion alone is unable to guarantee the extraction of useful features as it can lead to the obvious solution of simply copying the input or similarly uninteresting ones that maximize the mutual information in a trivial manner. Denoising Autoencoder (DAE) [1] is an extension of the classical autoencoder introduced specifically to address this phenomenon. As shown in Figure 1(b), DAE is trained to reconstruct a “clean” or “repaired” version of the corrupted input. This is achieved by first corrupting the original input to arrive at by means of a stochastic corruption process consisting in randomly setting some of the values in the input vector to zero [1]. Corrupted input is then mapped, as with the basic autoencoder, to a hidden representation from which we reconstruct a . Parameter is trained to minimize the average reconstruction error over a training set, that is, to have as close as possible to the uncorrupted input . There is a crucial limitation of DAE, which is high computational cost due to the expensive nonlinear optimization process. To this end, Chen et al. [4] proposed Marginalized Denoising Autoencoders (mDAE) which replace the encoder and decoder with one linear transformation matrix. mDAE provides a closedform solution for the parameters and thus eliminates the use of other optimization algorithms, for example, stochastic gradient descent and backpropagation. Liang and Liu [10] combined stacked Denoising Autoencoder with dropout technology together and reduced time complexity during finetuning phase. Moreover, when the input is heavily corrupted during training the network tends to learn coarsegrained features, whereas when the input is only slightly corrupted, the network tends to learn finegrained features. To address this problem, Geras and Sutton [3] proposed scheduled Denoising Autoencoders that learn features at multiple different levels of scale which starts with a high level of noise that lowers as training progresses. To reduce the effect of outliers, Jiang et al. [5] proposed a robust norm to measure reconstruction error to learn a more robust model. To improve denoising performance, Cho [11] proposed a simple sparsification method of the latent representation found by the encoder. Wang et al. [12] proposed a probabilistic formulation for stacked denoise autoencoder (SDAE) and then extend it to a relational SDAE (RSDAE) model which jointly performs deep representation learning and relational learning in a principled way under a probabilistic framework. These DAE algorithms address many shortcomings of traditional autoencoders such as their inability in principle to learn useful overcomplete representations and have been shown to generalize well even with noised input. However, DAE and its variants do not fare well on high dimensional and sparse data since many features are already zero in nature; any further reset of the feature vector has no more effects on the original data. Moreover, high dimensional data also lead to uneven distribution of uncorrupted features. To address the above challenges, in this paper, we propose a denoising scheme that is designed for high dimensional sparse data, which is labeled here as the InstanceWise Denoising Autoencoder (IDA).
3. Methodology
3.1. InstanceWise Denoising Autoencoder (IDA)
In this section we introduce a novel denoising method of autoencoder, which preserves its strong feature learning capabilities and alleviates the concerns mentioned.
Given original instances , and corrupt them by the modified strategy—if one nonzero feature of the instance is selected with a given probability , then this instance will be reset to zero totally. To be further specific, we generate bits boolean vector with probability of nonzero occurrence where each element corresponds to a feature. If the indices of nonzero element of and of an instance have overlap (i.e., ), all features of the instance will be reset as ; otherwise, the instance will be retained. It can be written as follows:
After denoising, the resultant input is denoted as . We reconstruct the inputs through minimizing the following reconstruction loss:where and denote different norm with corresponding , . According to different parameters , , , and , different coding can be obtained. The optimization methods and computation cost are also different.
(i) and Are Linear, , and . The loss function can be rewritten as . The reconstruction error norm is Frobenius norm [13] and then the solution will be obtained in a closedform directly.
(ii) Is Nonlinear and Is Linear, , and . The loss function can be rewritten as ; the solution can be obtained by Extreme Learning Machine Based Autoencoder (ELMAE) [14] where is first randomly assigned and then replaced by optimized result.
As shown in Figure 2, to recover the information loss led by instances corruption, multiple independent encoders combination is adjusted. In particular, versions of denoising inputs and corresponding independent autoencoder are constructed as the following:where and denote the different version of corrupted input and its corresponding encoder. Because all autoencoders are independent of the process from data retrieval to operation execution, they are suitable to be parallelized and carried out on a multicore CPU computer or a distributed computing platform.
After obtaining the code of every autoencoder, we can reach the final solution by combining them together; that is,
In comparison to featurewise denoising scheme where some features are corrupted, InstanceWise Denoising scheme has the following benefits: (i) tackling the challenging of high dimensional sparse data, (ii) reducing the data instance size used explicitly. For example, for a problem with 1 million data instances, if only 1% of the instances are retained in the corrupted inputs, the computational cost will be reduced to only 0.01% of the original (here we refer to the widely used [13]), and (iii) being easy to be implemented in parallel paradigm.
3.2. Stacked InstanceWise Denoising Autoencoder (SIDA)
IDA can be stacked to build deep network which has more than one hidden layer. Generative deep model built by stacking multilayer autoencoders can obtain a more useful representation to express the multilevel structure of image features or other data. Figure 3 shows a typical instance of SIDA structure, which includes two encoding layers and two decoding layers. Supposing there are hidden layers in the encoding part, we have the activation function of the th encoding layer:where the input is the original data . The output of the last encoding layer is the high level features extracted by the SIDA network. In the decoding steps, the output of the first decoding layer is regarded as the input of the second decoding layer. The decoding function of the th decoding layer iswhere the input of the first decoding layer is the output of the last encoding layer. The output of the last decoding layer is the reconstruction of the original data. The training process of SIDA is provided as follows.
Step 1. Train the first IDA, which includes the first encoding layer and the last decoding layer. Obtain the network weight and the output of the first encoding layer.
Step 2. Use as the input data of the th encoding layer. Train the th IDA and obtain and , where and is the number of hidden layers in the network.
It can be seen that each IDA is trained independently, and therefore the training of SIDA is called layerwise training (Algorithm 1).

3.3. SIDA for Transfer Learning
In the previous sections, we have shown the superiority of our algorithm in computational complexity for single source data. Nevertheless, with the multimedia data becoming current mainstream of information dissemination in the network, heterogenous data mining is more important than ever. In the this section, we are going to discuss how to integrate SIDA into the heterogeneous feature learning framework [15] and then apply it to two classical heterogenous data mining tasks consisting in heterogenous transfer learning and crossmodal retrieval. We term this SIDA for heterogenous data mining as HeterogenousSIDA.
Transfer learning has demonstrated its success in different applications. Our study focuses on heterogenous transfer learning which aims to learn a feature mapping across heterogeneous feature spaces based on some crossdomain correspondences. In this field, Shi et al. [16] proposed a spectral transformation based heterogenous transfer learning method which employs spectral transformation to map crossdomain data into a common feature space through linear projection. Duan et al. [17] used two different projection matrices to transform the data from two domains into a common subspace and then use two new feature mapping functions to augment the transformed data with their original features and zeros. Kulis et al. [18] proposed learning an asymmetric nonlinear kernel transformation that maps points from one domain to another. Zhou et al. [8] proposed a multiclass heterogeneous transfer learning algorithm that reconstructs a sparse feature transformation matrix to map the weight vector of classifiers learned from the source domain to the target. Glorot et al. [19] trained a stacked denoise autoencoder (SDAE) to reconstruct the input (ignoring the labels) on the union of the source and target data, and then a classifier is trained on the resulting feature representation. Chen et al. [20] proposed a marginalized Stacked Denoising Autoencoder (mSDA) for domain adaptation where the closedform solution is achieved for SDAE. Zhou et al. [15] further applied mSDA to learn the deep learning structure as well as the feature mappings between crossdomain heterogeneous features to reduce the bias issue caused by the crossdomain correspondences.
The HeterogenousSIDA model can be trained based on the multilayer heterogenous data fusion framework [15] as in Figure 4. In particular, given a set of data pairs from two different domains , the objective is to learn the weight matrices that project the source and target data to the th hidden layer, and , respectively, and also two feature mappings that map the data to a common space such that the disparity between source and target domain data is minimized:where and are regularization terms to avoid overfitting.
and can be computed by alternative optimized algorithm. However, here for simplification, we still let one weight matrix be unit matrix and just learn one feature mapping :where is a regularization term.
The closedform solution can then be obtained bySometimes, the correlation between two domains may be nonlinear, so we extend the linear mapping to the nonlinear one by kernel method. The dual form can be written as the following:where is kernel function such as RBF kernel.
The feature mapping can be obtained by
The representation of can be written as
After learning the multilevel features and mappings, for each source domain instance , by denoting as the representation of the th layer, one can define a new representation by augmenting the original features with high level features of all the layers to arrive at , where .
In heterogenous transfer learning, besides data pairs which come from the source and target domain, we always can collect a set of target domain unlabeled data , and a set of source domain labeled data . Based on HeterogenousSIDA trained by , we can obtain the augmented features of source data ; we then apply a standard classification (or regression/logic regression) algorithm on to train a target predictor .
For each target domain instance , the high level feature representation is first generated, where , then perform a feature mapping , and finally make prediction by .
3.4. SIDA for CrossModal Retrieval
Crossmodal retrieval is another important heterogenous feature representation application. Many crossmodal retrieval works focus on this issue through learning common space for two modality feature spaces. Rasiwasia et al. [21, 22] applied CCA to learn a common space between image and text cooccurrence data (image and text occurrence in one document). Semantic matching (SM) [21, 22] is to use Logistic regression in the image and text feature space to extract semantically similar feature to facilitate better matching. Bilinear model (BLM) [23] is a simple and efficient learning algorithm for bilinear models based on the familiar techniques of SVD and EM. LCFS [24] learns two projection matrices to map multimodal data into a common feature space, in which crossmodal data matching can be performed. GMLDA [25] adopts LDA under the multiview feature extraction framework. GMMFA [25] uses MFA for crossmodal retrieval under the multiview feature extraction framework.
In order to apply the proposed approach into crossmodal retrieval, instead of training a classifier, it needs to compute the similarity between crossmodal data in the common space. In particular, given a database of documents comprising image and text components, we consider the case where each document consists of a single image and its corresponding text; that is, , where image and text is represented as vectors in feature spaces and , respectively. The images and texts are processed by HeterogeneousSIDA and obtain multilayer weight matrix and their corresponding . Given a query image (text ), its representation can be obtained by (); crossmodal retrieval returns the text (image), represented by (), that minimizes the distance between and in the common space by some distance measure such as the distance [26], normalized correlation (NC) [27], and KullbackLeibler divergence (KL) [28, 29].
4. Experimental Study
In this section, we present the experimental study of IDA on three popular machine learning tasks including text classification, crosslanguage sentiment classification, and imagetext crossmodal retrieval to verify the performance of IDA from multiple aspects.
4.1. Results on High Dimensional Sparse Data
In order to compare the performance of IDA and SIDA (including serial and parallel implementation) on the high dimensional sparse data, here we select two popular datasets, News20.bin and Rcv1.mul as benchmarks. As detailed information of News20.bin and Rcv1.mul shown in Table 1, News20.bin contains 1,355,191 features and only about 0.0335% are nonzero values, and the dimensions of Rcv1.mul are 47,236 while only 0.14% are nonzero values. All the parameters are determined through crossvalidation. We select the simple linear SVM as classifier. The experimental results including classification accuracy (%) and training time (in seconds) are shown in Tables 2 and 3.
We can find that, with nearly the same performance, SIDA is significantly faster than SDAE up to one hundred times. For example, on News20.bin, SIDA (2layer 4500–1200) needs time of 503.45 seconds while SDAE needs about 18000 seconds, which is 100 times compared to SIDA.
When the autoencoders in SIDA are carried out in parallel, we can find that the speed is improved about nearly 2 times. For example, on News20.bin, SIDA obtains 2x speedup rate while on Rcv.mult it can be improved around 2.3 times. Our computer is a 4core CPU, and no optimized strategy for parallel running is adopted. We just modify “for” to “parfor” in our Matlab implementation. SIDA can obtain more significant advantage than SDAE in more efficient distributed computing platform.
4.2. CrossModal Retrieval on Wikipedia Data
The Wikipedia dataset (http://www.svcl.ucsd.edu/projects/crossmodal/) which has 2866 imagetext pairs is a challenging imagetext dataset with large intraclass variations and small interclass discrepancies. The context of each text article describes people, places, or events, which are closely relevant to the content of the corresponding image document. There are 10 semantic categories in the Wikipedia dataset, including art & architecture, geography & places, history, literature & theatre, biology, media, music, sports & recreation, royalty & nobility, and warfare as shown in Table 4. Here we follow the data partitioning procedure adopted in [21, 22] where the original dataset is split into a training set of 2173 pairs and a testing set of 693 pairs. Then, we evaluate our proposed method against the following stateoftheart crossmodal retrieval approaches.
(i) Correlation Matching (CM) [21, 22]. This method applied CCA to learn a common space in which the possibility of whether two different modal data items represent the same semantic concept can be measured.
(ii) Semantic Matching (SM) [21, 22]. This method applied Logistic regression in the image and text feature space to extract semantically similar feature to facilitate better matching.
(iii) Semantic Correlation Matching (SCM) [21, 22]. This method applied Logistic regression in the space of CCA projected coefficients (a twostage learning process).
(iv) Bilinear Model (BLM) [23]. This method is a suite of simple and efficient learning algorithms for bilinear models, based on the familiar techniques of SVD and EM.
(v) Learning Coupled Feature Spaces (LCFS) [24]. This method learns two projection matrices to map multimodal data into a common feature space in which crossmodal data matching can be performed.
(vi) Generalized Multiview Linear Discriminant Analysis (GMLDA) [25]. This method applied LDA with the multiview feature extraction (MFA) framework.
(vii) Generalized Multiview Marginal Fisher Analysis (GMMFA) [25]. This method applied MFA with the multiview feature extraction (MFA) framework.
We here use mean average precision (MAP) to measure the retrieval performance [30]. Two tasks were considered: text retrieval based on an image query and image retrieval based on a query text. In the first case, each image is used as a query and produces ranking of all texts. In the second, the roles of images and text were reversed. The scores for text retrieval from an image query, image retrieval from a text query, and their average are presented in the Table 5. From the results obtained, the following conclusions can be made: (i) the proposed method is shown to be superior to the simple random retrieval which forms the baseline for comparison. (ii) The proposed method outperforms PCA, BLM, GMMFA, GMLDA, LCFS, CM, SM, and SCM [21, 22] on image retrieval given text query and vice versa.
Figure 5 shows several example image queries and the images corresponding to the top retrieved text by the HeterogenousSIDA. Due to the limitation of pages, we only present the groundtruth images. The query images are framed in Figure 5(a), and the images associated with the four best text matches are shown on Figure 5(b). By comparing the category and text content, each of the top4 retrieved texts contains one or more relevant words to the image query or they are belonging to the category of query image.
(a)
(b)
Figure 6 depicted two examples of the text queries and corresponding retrieval results using HeterogenousSIDA. The text query is presented along with its corresponding groundtruth image. The top retrieved five images are shown below the text. By comparing the category and text content, we can find that HeterogenousSIDA retrieves these images correctly since they are belonging to the category of query text (“history” at the top, “sports” at the bottom) or the corresponding text contains one or more relevant words to the text query.
Figure 7 shows the MAP scores achieved per category by the proposed method and stateoftheart counterparts, SM, CM, and SCM [21, 22]. Note that, on most categories, the MAP of our method is competitive with those of CM, SM, and SCM.
(a) Image query
(b) Text query
(c) Average performance
4.3. Transfer Learning Results
In this section, we present further studies on the performance of IDA for a transfer learning task: crosslanguage classification. In particular, the crosslanguage sentiment dataset [31] is considered here. This dataset comprises the Amazon product reviews on three product categories: Books (B), DVDs (D), and music (M). These reviews are written in four languages: English (EN), German (GE), French (FR), and Japanese (JP).
For each language, the reviews are split into training and testing set, including 2,000 reviews per categories. We use the English reviews in the training dataset as the source domain labeled data and nonEnglish (each of the other 3 languages) reviews in a train file as target domain unlabeled data. Further, we use the Google translator on the nonEnglish reviews in the testing dataset to construct the crossdomain (English versus nonEnglish) unlabeled parallel data. The performances of all methods are then evaluated on the target domain unlabeled data.
Here we focus on crosslanguage crosscategory learning between English and the other 3 languages (German, French, and Japanese). This is a more challenging task than only crosslanguage. For a comprehensive comparison, we constructed 18 crosslanguage crosscategory sentiment classification tasks as follows:(i)ENBFRD and ENBFRM.(ii)ENBGED and ENBGEM.(iii)ENBJPD and ENBJPM.(iv)ENDFRB and ENDFRM.(v)ENDGEB and ENDGEM.(vi)ENDJPB and ENDJPM.(vii)ENMFRB and ENMFRD.(viii)ENMGEB and ENMGED.(ix)ENBJPB and ENBJPD. For example, the task ENBFRD uses all the Books reviews in French in the testing dataset and its English translations as the parallel dataset, the DVDs reviews in French as the target language testing dataset, and original English Books reviews as the source domain labeled data. We compare the proposed method with the following baselines.
(i) SVMSC [15]. This method first trains a classifier on the source domain labeled data and then predicts the source domain parallel data. By using the correspondence, the predicted labels for source parallel data can be transferred into target parallel data. Next, it trains a model on the target parallel data with predicted labels to make predictions on the target domain test data.
(ii) CLKCCA [32]. This method applied crosslingual kernel canonical component analysis on the unlabeled parallel data to learn two projections for the source and target languages and then train a monolingual classifier with the projected source domain labeled data.
(iii) HeMap [16]. This method applied heterogeneous spectral mapping to learn mappings to project two domain data items onto a common feature subspace. However, HeMap does not take the instance correspondence information into consideration.
(iv) mSDACCA [33]. This method adopts mSDA to learn a shared feature representation and conduct CCA on the correspondences between domains in the same layers.
(v) HHTL [15]. This method is our previous work where mSDA is applied to learn the deep learning structure as well as the feature mappings between crossdomain heterogeneous features to reduce the bias issue caused by the crossdomain correspondences.
The testing accuracy (%) is then summarized in Table 6. HTTL and HeterogenousSIDA have the same framework and only difference is that the embedded autoencoder is different (in HTTL the embedded autoencoder is mSDA). Through comparing these two algorithms we can clearly verify the performance of SIDA and mSDA on high dimensional and sparse data. The experimental results of our method and HTTL with the same layers show that, whether for 1 layer or 3 layers, our method can produce much better performance than HTTL. This shows that the proposed autoencoder method can learn useful higherlevel features to alleviate the distribution bias with the same number of layers. Additionally, the training time of these algorithms is reported in Table 7. We can find that the proposed algorithm is faster than HHTL. For example, For 3 layers, our method is faster than HHTL() in most cases. Compared with the other 4 transfer learning methods including SVMSC, CLKCCA, HeMap, and mSDACCA, the proposed method is also very competitive efficient from the testing accuracy and training time. Due to the deep structure. Our method with multiple layers is not the fastest algorithm, but it showcased improved prediction accuracies over the other counterpart algorithms significantly. This benefited from more appropriate high level features of SIDA and better crossdomain knowledge transfer in each layer.
Competing Interests
The authors declare that they have no competing interests.
Acknowledgments
This work is supported by National Science Foundation of China 61572399, 61373116, and 61272120; Shaanxi New Star of Science & Technology 2013KJXX29; New Star Team of Xian University of Posts & Telecommunications; Provincial Key Disciplines Construction Fund of General Institutions of Higher Education in Shaanxi.