Abstract

Due to missing values, incomplete dataset is ubiquitous in multimodal scene. Complete data is a prerequisite of the most existing multimodality data fusion methods. For incomplete multimodal high-dimensional data, we propose a feature selection and classification method. Our method mainly focuses on extracting the most relevant features from the high-dimensional features and then improving the classification accuracy. The experimental results show that our method produces considerably better performance on incomplete multimodal data such as ADNI dataset and Office dataset, compared to the case of complete data.

1. Introduction

In the era of Internet, there are many different modalities, such as images, video, and text. Different modalities can provide complementary information; therefore, multimodal classification can generally produce better performance than individual modality in accuracy and reliability. The diagnoses of Alzheimer’s Disease (AD) by multimodal classification are a great example and have achieved remarkable success compared to single modal methods in multiple experiments. Pang et al. [1] explored the possibility of improving emotion prediction by highly nonlinear relationships between low-level features in different modalities. Zhang et al. [2] incorporated three modalities of biomarkers (structural MR imaging (MRI), Positron-Emission Tomography (PET), and cerebrospinal fluid (CSF)) to discriminate AD (or mild cognitive impairment (MCI)) from healthy controls. Pang et al. [3] recommended using multilabel multiple-kernel learning with visual and textual features for multilabel image classification. Hu et al. [4] utilized multimodality data including both tag feature and visual feature for popularity prediction on social media. Ballard [5] suggested a multimodal learning interface which could learn words from natural interactions with users. Liu et al. [6] mentioned a multihypergraph learning (MHL) method to deal with multimodality data. This method achieved promising results in AD/MCI classification. Zhang et al. [7] proposed multimodal multitask learning to jointly predict multiple variables from multimodal data. Liu et al. [8] proposed a linearized and kernelized sparse multitask learning for predicting cognitive outcomes in Alzheimer’s Disease. Li et al. [9, 10] proposed a multitask deep learning method for diagnosing Alzheimer’s Disease by combining MRI, PET, and Assessment Scale-Cognitive subscale (ADAS-Cog) with the restricted Boltzmann machine. Wang et al. [11] explained a novel multimodality multicenter classification method for autism spectrum disorder diagnosis; they regarded the classification of each imaging center as one task and solved the classification for all imaging centers by introducing the task-task and modality-modality regularizations. Liu et al. [12] proposed a view-aligned hypergraph learning (VAHL) method and utilized incomplete multimodality data for AD/MCI diagnosis.

Complete data is a prerequisite of the most existing multimodality data fusion methods. Since complete data requires the modality type and the modality number to be consistent, it is rare in reality. In Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, for example, only about 1/3 of its total samples contain complete MRI, PET, and CSF data at baseline. In view of incomplete multimodality data, it usually explores imputing the missing values [13, 14] and discarding samples, and doing these will lead to waste of or bring unpredictable noise. To address the incomplete multimodality data, Zhao et al. [15] proposed an unsupervised method which processes the incomplete multimodality data by transforming the original and incomplete data to a new and complete representation in a latent space. Thung et al. [16] used incomplete multimodal dataset via matrix shrinkage and completion to identify AD patients. Li et al. [17] proposed a pioneer work to handle two-modal incomplete data case by projecting the partial data into a common latent subspace via nonnegative matrix factorization (NMF) and sparse regularizer. Following this line, Shao et al. [18] proposed a similar idea of weighted NMF and regularizer.

Most existing incomplete multimodality methods have low efficiency with high-dimensional data. Inspired by this, we propose a feature selection and classification for incomplete multimodal high-dimensional data. Our method has the following features: (1) It focuses on incomplete data and makes full use of the data from different modalities without data wasting. (2) It selects the most relevant features in high-dimensional space and facilitates the discovery of the inherent relationship between features. (3) It achieves better classification accuracy when compared with the other methods.

The rest of the paper will demonstrate the details of the proposed approach; experiments on various datasets and comparison between our method and the currently most advanced methods.

2. Multimodal High-Dimensional Feature Selection and Classification

2.1. The Framework

Figure 1 is the schematic illustration of our feature selection and classification framework for incomplete multimodal high-dimensional data. It contains three parts, polynomial kernel explicit expansion, feature selection, and classification. We start with the exploitation of the polynomial kernel explicit method to expand the original low-order feature to high-order feature. Then the identification of the high-order feature subsets by feature selection for incomplete multimodality data is achieved. Next, we classify AD patients and healthy controls. Finally, we integrate classifiers and export the final classification accuracy.

2.2. The Explicit Mapping

In a set of data samples , is the total number of samples. Each sample has modalities; i.e., . For incomplete multimodality data, we define as an incomplete modal sample set. In , , represent the number of missing samples and the feature dimension in the modalities, respectively. is the corresponding label of , and label , and is the number of data category. Since complete data discard a lot of precious data, the size of complete data is relatively small. As previously indicated, incomplete data is a collection of the various modal data subsets.

First-order spatial selection is difficult to reveal the high-order dependency relationship between features. Since the incomplete data has limitations about the number of samples and the feature dimensions, we need more features to discover the correlation between them. We can consider using different kernel functions to extend the data, for example, a linear kernel function or Gaussian kernel function. The linear kernel function directly linearizes the data, which makes it difficult to reflect the correlation between the features and may result in the loss of data. The Gaussian kernel function is too expensive to calculate. Therefore, we transform low-dimensional features into high-dimensional features by nonlinear kernel explicit expression and then reveal high-order correlation between features.

For degree-d polynomials, the polynomial kernel is defined aswhere and are vectors in the input space, and is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. As a kernel, corresponds to an inner product in a feature space based on some mapping :Let , and we get the special case of the quadratic kernel. After using the multinomial theorem and regrouping,From this, the explicit feature mapping of polynomial kernel isCompared with the linear case, the second-order feature map contains dependency of the feature pair. The key problem of this explicit feature mapping is that its features are high dimensional in the extended feature space. For polynomial kernel expansion, the dimension of the feature map increases exponentially. When , is the original feature dimension, and extended dimension is . Generally, when , the extended dimension is approximately .

2.3. Feature Selection and Classification

At first, we introduce a feature selection vector , whose entries are 1 for selected features and 0 otherwise. Let be the domain of . We use to control the sparsity of the feature selection, where controls the number of selected features; then the proposed problem can be written asin which constant is a regularization parameter that makes a trade-off between the model complexity and the fitness of the feature selection.

By introducing the dual variable , , the Lagrangian function of (5) can be written asin which denotes the inner product. By setting the derivatives of with respect to and to zero, we can obtain the Karush-Kuhn-Tucker (KKT) conditions, and . By substituting the above results into the Lagrangian function, problem (5) can be transformed into the following dual formulation:where and .

Since the feature selection vector is zero-one vector, this is still a nonconvex problem. Following the convex relaxation in [19], we haveBy introducing an additional variable , the above problem can be converted into the following convex quadratically constrained quadratic program (QCQP) problem:

It is very hard to solve as there are infinite number of quadratic inequality constraints in (9), and we solve this problem by the cutting plane algorithm [20]. We generate an active constraint and add it to an active constraint set which is initialized to empty set . The active constraint set is a subset of ; i.e., . Based on a new active constraint set , we need to solve QCQP problem to update . Specifically, we need to solve the following problem:The cutting plane algorithm can be presented in Algorithm 1.

(1) Initialize and the constraint subset .
Let iteration index .
(2) Find the most active constraint , update set
by .
(3) Update by solving the problem (10).
(4) Let . Repeat step - until convergence.
2.3.1. Learning

The cutting plane algorithm mainly deals with how to find the most active constraint of problem (10) at the th iteration. Let , and the optimization problem becomesDue to , problem (11) can be solved by sorting , and then find the largest .

2.3.2. The Optimization of

After updating the active constraint set , we solve the problem in (10) with constraints which are defined by . Because the number of constraints in is no longer large, we can use subgradient method to solve this problem. However, it is very expensive to get the dual variables when is very large.

In view of the above problems, problem (10) can be solved in the following primal form:where and .

For convenience, we define and . We solve the primal problem by using the accelerated proximal gradient (APG) [21], which minimizes the following quadratic approximation of (12):in which denotes the gradient of at point , denotes the Lipschitz constant, and . We need to solve the following Moreau Projection problem:Problem (14) has a unique closed-form solution, it can be solved in the following manner via Moreau Projection [22].

Suppose be the optimal solution to problem (14) and be an intermediate variable. Then is unique and can be calculated asin which . The intermediate vector can be calculated via a soft-threshold operator in [22, 23]and the threshold value can be calculated as in step of Algorithm 2.

Given input and .
(1) Calculate for all .
(2) Sort to obtain such that .
(3) Find
(4) Compute the threshold value .
(5) Calculate .
(6) Compute and output .

The overall APG algorithm for solving problem (14) is summarized in Algorithm 3. We can obtain from APG and then predict the results of each modality by our method. It can be expressed as . Eventually, the integration of the prediction result will enable us to do classification.

Initialization: Initialize the Lipschitz constant
and set by warm start,
, , parameter , and
.
(1) Set .
(2) Set .
For ,
Set , compute .
if ,
 set , stop;
else
.
End
end
(3) Set .
(4) Compute . Let
.
(5) Quit if stopping condition achieves. Otherwise, go
to step (1).

3. Performance Evaluation

3.1. Datasets

We evaluate the performance of our method by employing the ADNI and Office dataset, respectively. The ADNI dataset was launched in 2003 by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, the Food and Drug Administration, private pharmaceutical companies, and nonprofit organizations. The primary purpose of ADNI project was to study the effects of combining multiple biomarkers, such as MRI, PET, and CSF data accompanied with neuropsychological assessments, to predict the progression of MCI and early AD. We employ a 3-modality (MRI, PET, and CSF) dataset with 103 subjects which include 51 AD patients and 52 healthy controls. The Mini-Mental State Examination (MMSE) is a standardized cognitive impairment examination method for screening Alzheimer’s Disease. MMSE scores between 24 and 30 and a Clinical Dementia Rating (CDR) of 0 are designated as healthy controls; MMSE scores between 20 and 26 and CDR of 0.5 or 1.0 are considered as AD. Table 1 lists the demographics of all these subjects.

The multimodality data has 189 dimensionality features; for each subject, we obtain 93 features from MRI image, another 93 features from PET image, and 3 features from the CSF biomarkers. The size of feature dimension is relatively small. The nonlinear explicit expression is used to expand the dimension of data. After each item becomes one dimension, the 189 features are expanded into 8940 features. Now we can obtain the feature of a combination high-order disease.

Office dataset is as follows: amazon (e.g., images downloaded from the Internet), webcam (e.g., low-resolution images captured by web cameras), and dslr (e.g., high-resolution images taken from digital SLR cameras). Each dataset has 10 object classes. Specifically, Surf and Decaf features are extracted for all the images, and Decaf-LeNet and Decaf-AlexNet represent different Decaf features by training LeNet and AlexNet model, respectively. The feature dimension of Surf is 800 and the feature dimension of Decaf by training LeNet and AlexNet model is 4096, respectively. Table 2 lists the summarization of Office dataset. We expand these features into 180902 dimensional features by applying nonlinear explicit polynomial expression method.

3.2. Results on ADNI

We first use a 10-fold cross-validation strategy to classify AD and healthy controls in the single modality. We select 29 samples as training data and 10 samples as testing data from the ADNI dataset. For the purpose of the robustness and repeatability, this process is repeated 10 times to calculate the average of the classification accuracy as the final classification accuracy. The results are demonstrated in Table 3. For complete data, the classification accuracy on individual modalities MRI, PET, and CSF are 83.50%, 77.50%, and 78.70%, respectively. When using MRI and PET combination, the accuracy is 83.30%. When using PET and CSF combination, the accuracy is only 81.00%. The combined measurements of all three biomarkers of MRI, PET, and CSF achieves a classification accuracy of 81.40%.

Furthermore, due to the limitation of complete data, the size of incomplete data is larger than complete data. Specifically, our multimodal classification method for incomplete data achieves a classification accuracy of 91.10%, while the classification accuracy for complete data is only 81.40%. As we see from Table 3, incomplete data demonstrates much better performance than complete data in AD and healthy controls classification. The flexibility of incomplete data is better than complete data, because it takes advantage of valuable data samples and does not lead to waste data.

In Figure 2, we plot classification accuracy of complete and incomplete data corresponding to different iterations. The classification accuracies of incomplete data are better than complete data.

As mentioned in Section 2.3, B controls the sparsity of feature selection and has an important effect in the process of feature selection. In Figure 3, since different values produce different classification accuracies when MRI is used, the classification accuracy is greatly impacted by the choice of appropriate B value. In Figure 3, when , the mean of classification accuracy is higher than others. Therefore, we choose . So far, our method demonstrates much better performance on incomplete data.

In Table 4, we use incomplete multimodality data to compare the proposed method with other methods, including domain transfer support vector machine (denoted as DTSVM) [24] and multiple-kernel learning method (denoted as MKL) proposed in [2] using Lasso as feature selection. Table 4 lists the comparison of different methods for AD and HC classification.

Since our method uses nonlinear kernel explicit expansion and it maps features into high-dimensional features space, it is better in revealing high-order correlation between features. As we see in Table 4, our method outperforms the other methods for AD and HC classification. Our method achieves the classification accuracy of 91.10% with 90.00% sensitivity and 91.38% specificity. These results further validate the efficiency of our multimodal classification method.

3.3. Results on Office Dataset

In this section, we evaluate our method on Office dataset which includes the following three modalities Surf, Decaf-LeNet, and Decaf-AlexNet. We start the evaluation of conducting image classification by using our method on different modalities. Then we compare classification accuracy of incomplete and complete multimodal data on amazon, dslr, and webcam, respectively. In the experiments, we expand the dimensions of feature to 180902.

We test the classification performance on different datasets. Tables 57 show comparison of classification accuracy of incomplete and complete multimodality data on amazon, dslr, and webcam, respectively. As we see in Tables 57, the classification performance on incomplete multimodality data is better compared to complete multimodality data. We want to emphasize that our method maps the low-order feature to the high-dimensional space, and this is helpful to discover the nonlinear related features. Incomplete data not only make the best use of the precious samples, but also utilize the inherent relation and knowledge of all modalities data.

4. Conclusion

Authors proposed a feature selection and classification method for incomplete multimodal high-dimensional data. Our algorithm produces considerably better classification performance. The flexibility of incomplete data is better than complete data. Our method takes advantage of valuable data samples and does not lead to waste data. In addition, our method focuses on extracting the relevant features from incomplete multimodal data feature space. We use ADNI and Office dataset to verify the performance of our method. The results show that our method is better than other state-of-the-art methods. The limitation of this study is that it only considers simple model, and we will extend our method in more complex model in the future.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by National Natural Science Foundation of China Grant nos. 61572399, 61721002, 61532015, and 61532004; National Key Research and Development Program of China with Grant no. 2016YFB1000903; Shaanxi New Star of Science & Technology Grant no. 2013KJXX-29; New Star Team of Xian University of Posts & Telecommunications; Provincial Key Disciplines Construction Fund of General Institutions of Higher Education in Shaanxi.