Twin Support Vector Machine for Multiple Instance Learning Based on Bag Dissimilarities
In multiple instance learning (MIL) framework, an object is represented by a set of instances referred to as bag. A positive class label is assigned to a bag if it contains at least one positive instance; otherwise a bag is labeled with negative class label. Therefore, the task of MIL is to learn a classifier at bag level rather than at instance level. Traditional supervised learning approaches cannot be applied directly in such kind of situation. In this study, we represent each bag by a vector of its dissimilarities to the other existing bags in the training dataset and propose a multiple instance learning based Twin Support Vector Machine (MIL-TWSVM) classifier. We have used different ways to represent the dissimilarity between two bags and performed a comparative analysis of them. The experimental results on ten benchmark MIL datasets demonstrate that the proposed MIL-TWSVM classifier is computationally inexpensive and competitive with state-of-the-art approaches. The significance of the experimental results has been tested by using Friedman statistic and Nemenyi post hoc tests.
Standard pattern recognition problems consider that the objects are represented as a single feature vector which contains sufficient information for the recognition of these objects. However, some complex objects exist in the real world which are difficult to represent by using a single feature vector; that is, single feature vector representation of an object is not sufficient for its separability, for example, a document with several paragraphs, an image containing many regions, each with different characteristics, and a drug with various conformations of a molecule. Traditional supervised learning techniques handle such kind of problems by representing complex objects using single feature vector. This reduction may lose significant information which further degrades the performance of supervised learning techniques. A set of feature vectors or multiple instances representation can be used for the better understanding of complex object [1, 2]. Multiple instances representation of a complex object can preserve more information about it. MIL is a variation of supervised learning in which a classifier is trained on a set of instances known as bag instead of individual instance. The objective of MIL approaches is to predict the class label for a bag. A bag may have different number of instances and may belong to positive or negative class label. Positive class label is assigned to a bag if it contains at least one positive instance while negative class label is assigned to a bag when all of its instances are negative [1, 3–5]. Figure 1 illustrates the framework of single instance learning and multiple instance learning.
(a) Single instance learning
(b) Multiple instance learning
From Figure 1, it is observed that, in single instance learning, each object is represented by a single feature vector or instance and the classifier learns at instance level by assigning the class label to each instance individually. However, in MIL framework, an object is represented by a set of feature vectors or a bag and the classifier trains at bag level instead of instance level and predicts the class label of a bag instead of an instance.
The term MIL was first used for the drug activity prediction problem (Musk odor prediction) . Later on, it is widely used by the researchers to solve various real world problems like image annotation [6–9], document categorization [6, 10, 11], object detection [12, 13], human action recognition , visual tracking [15–18], spam filtering , and many other problems. Several MIL approaches have been proposed by the researchers which can be broadly categorized into two groups. The first category, also known as bag-based method, works only on the bag label without having any knowledge of each instance label. Thus the bag labels can be predicted by converting a bag into a single instance representation and using supervised algorithms or by defining kernels or distances between bags. MIL approaches belonging to first category include those of Chen et al. , Sørensen et al. , and Cheplygina et al.  that use some dissimilarity measures to represent a bag with a derived feature vector. Wang and Zucker  defined nearest neighbors among bags, Gärtner et al.  and Wang et al.  determined kernels between bags, Zhou et al.  generated a graph with instance from a bag, and Zhang et al.  incorporated structure information between bags and many more. On the other hand, the second category also known as instance-based methods focuses on the instance label and a bag label is determined by combining the classification of instances. Axis-parallel rectangle method  and Diverse Density  and its variation  are some examples of instance-based approaches. The bag-based methods are widely used by the researchers as they have shown better performance on a wide range of MIL datasets. Therefore, this study has focused on the first category and extended the recently proposed Twin Support Vector Machine (TWSVM) classifier to multiple instance learning scenarios by obtaining summarized information of each bag using different dissimilarity measures .
In recent years, many nonparallel hyperplane Support Vector Machine (SVM) classifiers are proposed by the researchers for binary classification [27–29]. For example, Mangasarian and Wild proposed a Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM), the first nonparallel hyperplane classifier, which aims to find a pair of nonparallel hyperplane in such a way that each hyperplane is nearest to one of the two classes and as far as possible from the other classes . GEPSVM shows excellent performance with several benchmark datasets especially with the “Cross-Planes” dataset. Later, by utilizing the concept of traditional SVM and GEPSVM, Jayadeva et al. proposed a nonparallel hyperplane based novel binary classifier, named as TWSVM . The aim of TWSVM classifier is to generate two nonparallel hyperplanes in such a way that each hyperplane lies in close affinity to one of the two classes while maintaining distant from the data instance of other classes. For this purpose, it solves two SVM-type Quadratic Programming Problems (QPPs), while GEPSVM solves two generalized eigenvalue problems. Since TWSVM solves two smaller QPPs as opposed to a single complex QPP, the learning of TWSVM classifier is four times faster than that of standard SVM. TWSVM has shown its superiority over the other existing machine learning approaches on several benchmark datasets. Therefore, in this study, we have extended TWSVM to multiple instance learning scenarios. This paper proposes a bag dissimilarity based multiple instance learning TWSVM (MIL-TWSVM) classifier. We have defined the dissimilarity between bags using different approaches. The proposed classifier is trained with the summarized information of instances in each bag where a bag is represented by a feature vector. Feature vector contains the dissimilarity scores of a bag derived from the other bags in the training set. The experiment has been performed on ten MIL benchmark datasets. The results of the proposed approach have been compared with several existing MI learning approaches, such as Diverse Density (DD) , Expectation-Maximization Diverse Density (EMDD) , Multi-instance Logistic Regression (MILR) , Citation k-NN , and Multi-Instance Support Vector Machine (MISVM) . The effectiveness of the proposed approach has also been analyzed by using Friedman average rank hypothesis tests [32, 33]. The statistical inferences are made from the observed difference in predictive accuracy. Modified version of Demšar significance diagram has been used to display the output of Friedman test.
The rest of the paper is organized as follows. Section 2 provides a brief overview of multiple instance learning approaches and their applications in the real world. Section 3 includes the formulation of Twin Support Vector Machine classifier. Section 4 describes different approaches used to measure the dissimilarity between bags. The experimental results are discussed in Section 5 and finally the conclusion is drawn in Section 6.
2. Overview of Multiple Instance Learning Approaches and Their Applications
In Multiple Instance Learning (MIL), a bag is used to represent an object as follows:where is the number of instances or feature vectors in bag and is the -dimensional feature space. Consider the training dataset contains bags. The training dataset for MIL is represented aswhere is the class label corresponding to each bag in the dataset. The bags are labeled with class if and only if it contains at least one positive instance; otherwise it is considered as negative labeled bag. The objective of MIL problem is to learn a model which can determine the class label of the unseen bag. MIL approaches have been widely used by the researchers to solve many real world problems. The drug activity recognition is one of the most popular applications of it. In this problem, for a given chemical molecule, the system must decide if it is useful for drug design or not. A good drug has the characteristic that it is strongly bound to a target “binding site.” A molecule can adopt multiple conformations or shapes and only one or a few of them bind well with the target protein or binding site. Dietterich et al. modeled the MIL framework to deal with the drug activity recognition . They have predicted whether a new molecule was suitable for drug design or not by analyzing a set of known molecules. They have developed a three-axis-parallel rectangle algorithm in which the combination of extracted molecule features was used to determine the axis-parallel rectangles (APR). Zhao et al. proposed an MIL approach based on joint instance and feature selection for drug activity prediction . They have focused on irrelevant and redundant features reduction in order to improve the interpretability of the drug activity recognition model. Diverse Density (DD) has been proposed to measure a region in the feature space which consists of at least one instance from each positive bag while there are no instances from negative bags . DD of a given point is defined as the ratio of the number of positive bags which have instances near to this point and the sum of distances of negative instances from . The point at which DD is maximized corresponds to the target concept. DD has suffered from local optimization problem and the best solution could be achieved by many restarts. An algorithm Expectation-Maximization Diverse Density (EMDD) has been proposed by Zhang and Goldman to solve this problem . They have combined expectation-maximization approach with DD to solve the local optimum problem by iteratively updating the previous target point. Expectation step finds the most positive instance from each bag after an initial guess for the target point . Then, maximization step searches for a new point by maximizing DD on the selected most positive instances. These steps are repeated until the algorithm converges. EMDD performs well on a variety of MIL problems, but it is also very computationally intensive. Several regular supervised classifiers have also been extended to MIL scenarios, for example, Citation k-Nearest Neighbor (Citation k-NN), Bayesian k-NN, ID3-MI, and multiple instance learning SVM. Citation k-NN is an extension of k-NN in which a bag has been labeled by analyzing its neighboring bags and the bags that consider the concerned bag as a neighbor . Citation k-NN uses different distance metric (minimal Hausdorff distance) in which the focus has been shifted from instances to the bags; that is, the distance is measured between different bags instead of different instances (see Figure 2). Citation k-NN has shown better performance on Musk datasets. Recently, a variance of Citation k-NN has been proposed by Zhou et al. for web mining task in which the minimum Hausdorff distance has been modified for text features . ID3-MI is a decision tree algorithm which uses multi-instance entropy criterion to split the tree nodes .
SVM has been extended to MIL scenario by Andrews et al. . They have used two approaches for the extension of SVM. In the first approach, traditional SVM has been extended to MIL scenarios in which hidden labels of instances are decided under constraints posed by the class labels of bags. In the second approach, the objective was to maximize the bag margin directly. MIL Boost is another example of MIL approach where the weights of instances are updated in each of the boosting rounds . In this approach, Noisy-OR rule is used to determine the bag labels from given instance labels. Logistic Regression and Neural Network are also extended to the MIL framework. Logistic Regression is a popular probabilistic supervised learning approach which has been extended to MIL framework. Fu and Robles-Kelly extended Logistic Regression (LR) to MIL problem domains and combine and regularization methods . Xu and Frank also upgraded single instance LR to multi-instance data and showed its effectiveness on artificial and Musk drug activity prediction dataset . They have followed several assumptions to form the bag-level probability from instance-level class probabilities. Ramon and De Raedt have also explored the utility of NN for MIL due to its ability of automatic learning from examples . In another research work, Zhou and Zhang proposed BP-MLP which is the extension of NN to MIL . In this approach, traditional BP algorithm has been extended using a global error function which is defined at bag level rather than at instance level. Image classification and retrieval is another significant application of MIL in which a given image is to be classified into a target category on the basis of its visual content [40–42]. In MIL, an image can be viewed as a bag of local image patches and can be labeled as positive or negative. A positive label image contains a set of image patches or instances in which at least one patch is conceptual to the user while if all the patches are not conceptual to the user then image is considered as negative label image. For example, in beach scene classification, the target class is beach and by using different regions or contents of the scene image (see Figure 3), the objective is to recognize whether it belongs to beach scene or not. In this case, any visual content that displays a beach may be considered as positive image while negative images show another different visual content.
Xu proposed an MIL extension of Neural Network for the retrieval and classification of images . Maron and Ratan applied DD MIL approach for the classification of natural image scene by using different kinds of bag generators . Bag generators consider each image as a bag and various subregions in the image as instances. They have performed experiment on COREL photo library. Cheng et al. proposed BP-MLP and BP-SVM approaches for automatic image categorization in multiple instance learning scenario and performed experiment on 2000 images obtained from COREL repository . They have extracted frequent patterns from each image category and embedded an image bag into a multidimensional data point which is useful to characterize the similarity between the image and every common pattern of an image category. Gondra and Xu proposed a Relevance Feedback (RF) learning based Content Based Image Retrieval (CBIR) framework in multiple instance learning scenario . In another research work, Pao et al. proposed an EMDD based MIL method for image classification . Sener and Ikizler-Cinbis proposed an ensemble of multiple instance learning approaches for the problem of image reranking . They have constructed bags by using three different approaches: sliding window and dynamic and dynamic-sliding methods. Then the constructed bags have been used to develop multi-instance classifiers. They have used multiple instance learning with instance selection (MILES) algorithm as MIL-classifiers. Rank was assigned to an image by combining the decision score of MIL-classifiers. Li and Liu used graph based MIL with instance weighting for the retrieval of images . Different weights were assigned to each region in positive images on the basis of learning results and then rank was calculated for each image. Feng et al. proposed a multi-instance semisupervised learning approach on the basis of hierarchical sparse representation for the categorization of images . They have solved the instance confidence value identification problem under the framework of instance-level sparse representation. Several other research works have also focused on the image categorization problem in MIL framework. Xu et al. utilized the concept of deep learning of feature representation with multiple instance learning for colon cancer classification based on histopathology images . The deep learning network is a process of obtaining high level features from low level features. They have proposed a system based on deep learning having a set of linear filters in encoder and decoder and used the last hidden layer of deep learning as fully supervised feature learning, as it represents intrinsical features compared to lower level features. Wu et al. extended deep learning to multiple instance learning framework for image annotation . They have used deep convolutional neural network which contains five convolutional layers, followed by a pooling layer and three fully connected layers for learning visual representation with multiple instance learning. The last hidden layer was redesigned for multiple instance learning. Kotzias et al. have also combined the concept of deep learning and multi-instance learning for knowledge transfer . MIL has also been utilized in disease diagnosis by analyzing medical images. Ding et al. considered the breast ultrasound image classification task as multiple instance learning tasks and proposed an MIL method based on SVM which classifies the tumors into benign and malignant . They have used self-organizing map (SOM) to map the instance space into concept-space and constructed the bag feature vector by using the distribution of the instances of each bag. Li et al. proposed a novel computer aided diagnosis scheme for the recognition of tumor invasion depth of gastric cancer . They have extracted both bag-level and instance-level features and applied an improved citation k-NN algorithm for the identification of gastric tumor invasion depth. Tong et al. used MIL method for the detection of Alzheimer’s disease (AD) and its prodromal stage mild cognitive impairment (MCI) . They have built a graph for each image to identify the relationships among the patches and performed experiment on 834 MRI images taken from ADNI study. In another research work, Quellec et al. proposed an MIL framework for diabetic retinopathy screening . Text categorization is another popular application of MIL. Wang et al.  proposed a novel instance specific distance method for the application of MIL text categorization. They have derived this data from Reuters-21578 collection having 2000 bags with 243 features. He and Wang investigated the problem of text categorization from multiple instance view in which each text is considered as a bag and each of its sentences as instance . They have developed an MIL approach for Chinese text classification using k-NN. MIL is also used for web mining or web index recommendation problem in which each web page is considered as a bag and each of its linked pages is considered as bag instances. Viola et al.  proposed an algorithm, known as Fretcit k-NN based on minimum Hausdorff dissimilarity measure, and determined the class label of unseen bag by utilizing both references and citers. MIL approaches also have significant contribution to visual tracking [15, 17, 58] and real time video event detection areas .
3. Twin Support Vector Machine
Twin Support Vector Machine is a binary classification technique that does classification of data instances by constructing two nonparallel hyperplanes instead of a single hyperplane as in the case of traditional Support Vector Machine. It obtains two nonparallel hyperplanes by solving two QPPs of smaller size as compared to a single complex QPP solved by traditional SVM. TWSVM generates hyperplane for each class in such a way that the data instances of each class lie in close affinity to its corresponding hyperplane and as far as possible from the other hyperplane. The effectiveness of TWSVM over other existing classification approaches has been validated on various benchmark datasets. TWSVM has better generalization ability and faster computational speed due to which it has been applied to several real life applications such as intrusion detection [60, 61], activity recognition , image denoising , emotion recognition , text classification , defect prediction [66, 67], disease diagnosis [68, 69], and speaker identification . Consider a binary classification problem of “” size. The training dataset for such kind of problem can be represented aswhere , , represents input data instances in -dimensional feature space and indicates corresponding class label. Consider two matrices and comprising the data instance of class and class , respectively. TWSVM solves the following two QPPs:and seeks the following two nonparallel hyperplanes in :Here, and are normal vectors to the hyperplanes; and represent the bias terms. and are two vectors of 1’s of appropriate dimensions. and are two positive trade-off constants. and are slack variables due to the class and class , respectively. The first term of (4) or (5) is the sum of squared distances of data instances from their corresponding hyperplane. Minimization of this term keeps the hyperplane closest to the data instances of class or class . The second term of (4) or (5) assigns penalty to the data instances of other classes which are misclassified. The constraints require the hyperplane to be maintained at least 1 distance from the data instances of other classes. Slack variable measures the error wherever the hyperplane is closer than the 1 distance. In this way, the hyperplane is kept closer to the data instances of its respective class and as far as possible from the data instances of other classes. The Lagrangian corresponding to (4) is given as follows:where and are two vectors of Lagrange multipliers. The Karush-Kuhn-Tucker (KKT) conditions are given bySince , from (10), we can determineEquations (8) and (9) lead toLet , , and . The above equation becomesIn similar manner,From (16) and (17), it is clear that the solution of hyperplane parameters requires the inverse of matrix and . Sometimes, matrix may be ill-conditioned due to which it is difficult to calculate its inverse. To avoid this situation, regularization terms and are added to the above-mentioned equations as follows:where are user defined parameters having small values and is an identity matrix of appropriate dimension. Wolfe dual of (4) and (5) can be defined asUsing these equations, we can determine Lagrangian multipliers which are further useful to obtain hyperplane parameters. In this way, hyperplane is constructed for each class using (6). A class or is assigned to new data instance depending upon its closeness to the two nonparallel hyperplanes. TWSVM assigns class label to an instance by using the following decision function: where is the absolute value. TWSVM has also been extended to the nonlinear cases where data instances are not separable by linear class boundaries. For this purpose, it uses kernel trick to transform the data instances into higher-dimensional feature space. Nonlinear TWSVM seeks the following two kernel surfaces instead of planes: where is any arbitrary kernel function and . The primal QPPs of nonlinear TWSVM corresponding to kernel-generated surfaces (21) are given below:Similar to (16) and (17), kernel-generated surface parameters can be determined asHere, and . Similar to the linear case, regularization terms and are added to (23) to avoid the ill-conditioned matrices. A new data instance is labeled with class or in a similar manner to the linear case.
4. Bag Dissimilarity Representation
In multiple instance learning case, a classifier works at the bag level rather than instance level and takes a bag as an input. Therefore, the objective of MIL is to develop a classifier which generates a decision function for the bag. In the proposed approach, each bag is represented by a vector of its dissimilarities to the other bags in the training set. The dissimilarity of a bag from all other bags represents a feature vector. If there are bags in the training set and th bag contains number of instances, then th bag can be represented asThus, each bag has a single feature vector or instance representation and the MIL problem can be considered as a regular supervised learning problem. The dissimilarity between two bags and is measured by using different ways which are classified into two main categories on the basis of bag representation. Consider the representation of a bag as a point set of the high-dimensional feature space, and then the dissimilarity between two bags can be measured using a set distance. The following distance metrics have been used to calculate the dissimilarity between bags.
(a) Hausdorff Distance. The Hausdorff distance is one of the most popular distance metrics used in object recognition in the field of computer vision. Two bags and are said to be close to each other if every instance of bag is close to an instance in bag . The dissimilarity between two bags and is defined asHere, represents the directed distance between two bags and . In detail, given two bags and , the directed distance between and is calculated asSimply, measures the Euclidean distance between the instances and . The -dimensional representation of th bag, , is formed as a vector of such dissimilarities between th bag and all the other bags in the training set. The final Hausdorff distance between two bags is symmetrized by taking the maximum of the directed distance between them as is not symmetric. The dissimilarity between two bags can also be defined by taking the minimum and average of squared Euclidean distance as follows:Figure 4 shows the minimum Euclidean distances between instances of two bags. According to Figure 4(a), all the instances in bag have the same closest instance in bag while instances in bag have two different closest instances in bag . Due to which the minimum distance between instances of two bags is asymmetric . This distance can be symmetrized by again taking the minimum of these distances; that is, . For the case of average distance as given by (28), .
(b) City Block Distance. The dissimilarity between two bags can be defined by using City Block distance metric as follows:
(c) Chi-Squared Distance. Chi-squared distance is the weighted Euclidean distance which measures the dissimilarity between two bags as follows: If a bag can be viewed as a probability distribution in the instance space, then the dissimilarity between two bags can be defined through distribution distance. It is not only difficult to determine the probability density function in a higher-dimensional feature space but also very computationally expensive to estimate the true distributions of instances. Therefore, the instance distributions are approximated and the distance is measured between the approximated distributions by using the following two distance metrics.
(a) Earth Mover’s Distance (EMD). Earth Mover’s distance  measures the minimum amount of work to transform one probability distribution into another probability distribution . Consider that each instance has of the total probability mass in bag of size . The Earth Mover’s distance between two bags is computed aswhere is the Euclidean distance and is the flow between instances “” and “” associated with additional constraints: , , , and .
(b) Mahalanobis Distance. Each bag is approximated by a single Gaussian distribution with mean and covariance matrix parameters. Bag dissimilarity between two bags and through Mahalanobis distance is defined asIn this way, we can calculate the dissimilarity score of a bag from the rest of the other bags. The new vector representation of each bag acts as an input to the TWSVM classifier which now works at the bag-level . Now the problem has been converted into the single instance binary classification problem in which a bag has either or class label. Figure 5 depicts the example of multiple instance learning data having four bags. Each bag contains different number of instances such that bag 1 and bag 3 have three instances while bag 2 contains two instances and bag 4 consists of four instances. Class label is associated with each bag instead of individual instance. Traditional supervised learning approaches are not designed for such type of problems. Thus a bag-level MIL-TWSVM classifier is trained with this summarized data. During testing phase, the similar representation is obtained for the bag query and the proposed classifier takes a decision for a bag on the basis of minimum distance criteria.
5. Numerical Experiments
This section presents the experimental results of our proposed MIL-TWSVM classifier on ten benchmark MIL datasets. We have analyzed the performance of proposed MIL-TWSVM classifier with different dissimilarity metrics. The results of MIL-TWSVM have been compared with several existing MIL approaches such as Diverse Density (DD), Expectation-Maximization Diverse Density (EMDD), Multi-Instance Logistic Regression (MILR), Citation k-NN, and Multi-Instance Support Vector Machine (MISVM). All these classifiers have been implemented in MATLAB 2012a on Windows 7 operating system with Intel core i-7 processor with 12 GB RAM. This section has been divided into four subsections. The first subsection includes the description of benchmark MIL datasets used in this study. The second subsection analyzes the impact of parameters on the performance of proposed classifier. Experimental results are discussed and analyzed in subsections three and four, respectively.
5.1. Dataset Description
In this study, the experiment has been performed on ten MIL benchmark datasets: Musk 1, Musk 2, Mutagenesis-atoms, Winter Wren, Brown Creeper, Elephant, Fox, Tiger, eastWest, and westEast datasets. These datasets are available online at http://www.miproblems.org/. The detailed description of these datasets is shown in Table 1. These datasets are widely adopted for the performance evaluation of new MIL approaches. These datasets represent four different categories of MIL example.
Musk 1 and Musk 2 are two standard drug activity prediction benchmark datasets in which a bag is represented by one molecule and different conformations or shapes of these molecules are the instances of a bag. In these drug activity prediction datasets, a bag is assigned with the class label “musk” or “non-musk” by human expert. Musk 1 dataset contains 92 bags while Musk 2 dataset contains 102 bags. Musk 2 dataset contains more number of instances or molecule conformations as compared to the Musk 1 dataset. The objective of MIL is to predict whether a new molecule is “musk” or “non-musk.” Another dataset that belongs to the category of drug activity prediction is Mutagenesis-atoms. This dataset contains 125 positive and 63 negative bags. Brown Creeper and Winter Wren are two audio MIL datasets which contain the audio of bird songs of different species. A bag is represented by an audio fragment. A bag is labeled as positive if particular species is heard in the audio fragment for that category. Since the birds of the same species have similar songs or audio fragment, therefore different bird species have different concepts. It is also possible that some species are heard together more often. In this case, the audio fragments, which are not heard or are negative for one bird species, could be useful to determine whether an audio fragment contains that species or not. Content Base Image Retrieval (CBIR) is another one of the most recognized applications of MIL in which the objective is to determine whether the given image is of interest to user or not. A set of regions or image patches represent an image. An image corresponds to a bag and image regions represent the instances in each bag. The class label of individual instances is unknown. In this study, we have used three image datasets: Elephant, Fox, and Tiger. eastWest and westEast datasets belong to an ILP problem and have been collected from eastWest challenge. The objective of this challenge is to predict whether a train is eastbound or westbound. In eastWest or westEast datasets, a bag represents a train which contains various cars (instances) of different shapes and sizes. Each car having different loads represents its instance-level attributes. eastWest data challenge has two MI datasets: eastWest and westEast as it is not clear whether an eastbound or westbound train can be considered as positive label example. In eastWest dataset, eastbound trains are regarded as an example of positive class label. Similarly, westbound train is considered as positive example in the westEast dataset.
5.2. Parameters Selection
This study has used Gaussian Kernel function for nonlinear case. MIL-TWSVM classifier has four penalty parameters: , , , and and an additional kernel parameter sigma . The predictive performance of the classifier gets affected by the choices of these parameters. This study has used Grid Search approach which is one of the widely used approaches for the optimal parameters selection [27, 65, 72–74]. The penalty and kernel parameters are selected from the following range: and . The experiment has been conducted using 10-fold cross-validation approach. It trains the proposed classifier with each pair (penalty and ) in the Cartesian product of these two sets and evaluates their performance by internal cross-validation on the training set, in which case multiple MIL-TWSVMs are trained per pair. Finally, it outputs the settings that achieved the highest score in the validation procedure. We have analyzed the influence of these parameters on the performance of MIL-TWSVM on three datasets: Tiger, Fox, and Mutagenesis-atoms as shown in Figures 6, 7, and 8. For linear case, we set and to reduce the computational complexity and analyze their influence on the predictive performance of linear classifier. However, for nonlinear case, consider to reduce the computational complexity and analyze the influence of these parameters and sigma on the predictive performance of nonlinear MIL-TWSVM classifier. For tiger dataset, the impact of parameters has been analyzed using Max-Hausdorff dissimilarity measure as shown in Figure 6. From the figure, it is observed that the proposed linear MIL-TWSVM classifier has obtained better performance with low value of and high value of parameters . The performance of MIL-TWSVM suddenly degrades for low value of parameter. For nonlinear cases, MIL-TWSVM obtains better performance with high value of sigma and low value of penalty parameter . Max-Hausdorff based MIL-TWSVM classifier shows better performance for different combinations of penalty and sigma parameters on other datasets.
The impact of these parameters on Fox and Mutagenesis-atoms datasets has been analyzed using Min-Hausdorff and EMD dissimilarity measures, respectively. On fox dataset, Min-Hausdorff based linear MIL-TWSVM has shown better performance for low value of and high value of . Nonlinear MIL-TWSVM classifier has achieved better predictive accuracy with high value of sigma and low value of penalty parameter on fox dataset as shown in Figure 7. For Mutagenesis-atoms, EMD based linear MIL-TWSVM has obtained highest accuracy for low value of and parameters . Nonlinear MIL-TWSVM has gained highest accuracy on Mutagenesis-atoms dataset for low value of sigma and penalty parameter as shown in Figure 8. For every combination of these parameters (penalty and sigma) and dissimilarity measures, the proposed MIL-TWSVM classifier behaves differently on different datasets. Therefore, the appropriate selection of these parameters is essential to obtain better performance of MIL-TWSVM classifier.
5.3. Results and Discussion
The result includes the average and standard deviation of classification accuracies of the 10-fold cross-validation. Bold values indicate better predictive accuracy of the classifier. Min-Hausdorff dissimilarity score based linear MIL-TWSVM gains highest accuracy on Musk 1, Winter Wren, Elephant, Fox, Tiger, and westEast datasets. Max-Hausdorff based linear MIL-TWSVM classifier obtains highest accuracy on Musk 2 and Brown Creeper datasets. EMD based linear MIL-TWSVM achieves highest accuracy on Mutagenesis-atoms dataset.
Other bag dissimilarity measurements based MIL-TWSVM classifier shows poor performance on all type of datasets. Similarly, for nonlinear case, Min-Hausdorff and Max-Hausdorff dissimilarity score based MIL-TWSVM classifier has shown better performance on Winter Wren, Brown Creeper, Elephant, Tiger, Fox, eastWest, and westEast datasets. EMD based nonlinear MIL-TWSVM classifier achieves highest predictive accuracy on Mutagenesis-atoms dataset. Therefore, we can conclude that the MIL-TWSVM has shown better performance with Min-Hausdorff and Max-Hausdorff dissimilarity scores. Further, we have compared the performance of Min-Hausdorff based MIL-TWSVM classifier with the existing MIL approaches- Expectation-Maximization Diverse Density (EMDD), Diverse Density (DD), Multi-Instance Logistic Regression (MILR), Citation k-NN, and Multi-Instance Support Vector Machine (MISVM) as shown in Table 4.
From Table 4, it is observed that the proposed MIL-TWSVM classifier has achieved highest predictive accuracy on all ten benchmark datasets and thus performs better than the other existing MIL approaches.
5.4. Statistical Comparison
Friedman test statistic [32, 33] assigns rank to each classifier according to their performance on each dataset independently. For example, the first rank is given to the best performing classifier; second best performing classifier gets second rank. Average rank is given to the classifiers in case they have shown the same performance. Let be the rank of th classifier on th dataset. Friedman test statistic is calculated asHere, represents the number of datasets used in this study for comparison purpose; denotes the number of classifiers and is the average rank of th classifier. Friedman test statistic follows chi-squared distribution with degrees of freedom. The null hypothesis which states that there is no difference between classifiers can be rejected or accepted according to the value of Friedman test statistic. If the value of Friedman test statistic is large as compared to the critical value corresponds to degrees of freedom then we can accept or reject the null hypothesis. The Nemenyi post hoc test  reports the significant differences between individual classifiers. According to this test, two classifiers are significantly different if their average rank differs by at least the critical difference (CD) which is obtained aswhere is calculated on the basis of studentized range statistic. The results of Friedman test statistic are plotted by using modified Demšar significance diagram . We have calculated the average rank of each MIL approach on the basis of its performance on each dataset (see Table 4). Then, the Friedman test statistic is calculated according to (31). From Table 4, it is observed that the Min-Hausdorff dissimilarity based MIL-TWSVM classifier achieves highest average rank among all MIL approaches. Maximum Hausdorff based MIL-TWSVM classifier gets second highest average rank. Consider ; then the critical value for 8-degree of freedom from chi-squared table is 15.507. The obtained Friedman test statistic value is 36.83 which is very higher than the critical value of 8-degree of freedom. Hence, we reject the null hypothesis which states that there is no difference between the classifiers. Critical value for nine classifiers is 3.102. Critical difference for is determined using (34) as follows:Figure 9 depicts the Demšar significance diagram in which MIL approaches are arranged in ascending order on the -axis as per their average rank and their corresponding ranks are mentioned on the -axis.
Critical difference value has been added to the average rank of each MIL approach in order to analyze whether the proposed approach is significantly better than the other MIL approaches. Two vertical lines in red color depict the difference of the end of the best performing MIL approach’s tail and the start of the next significantly different MIL approach. From the figure, it is clear that the other existing MIL approaches such as EMDD, DD, MILR, Citation k-NN, and MISVM perform significantly worse than the best performing approach which is Min-Hausdorff based MIL-TWSVM. Thus, we can conclude that the proposed MIL-TWSVM is a suitable choice in the multiple instance learning problem domains.
This study has focused on multiple instance learning in which a classifier learns from a set of feature vectors (bag) instead of single feature vector and has proposed an MIL approach based on TWSVM, termed as MIL-TWSVM. Each bag is denoted by a vector of its dissimilarities to the other bags in the training set and the proposed classifier has been trained with this summarized information. Initially, the performance of proposed MIL-TWSVM classifier has been compared with different dissimilarity scores on ten benchmark MIL datasets. We have also compared the performance of MIL-TWSVM with six existing MIL approaches. Experimental results demonstrate that the proposed approach has achieved highest predictive accuracy as compared to the other existing MIL approaches on all ten datasets. This further supports the suitability of MIL-TWSVM in multiple instance learning scenarios. The findings of experimental results are also supported by the statistical analysis performed by using Friedman test. The test shows that the MIL-TWSVM is significantly better than the EMDD, DD, MILR, Citation k-NN, and MISVM. In the future, we are interested in extending MIL-TWSVM to multi-instance multilabel scenario.
The authors declare that they have no conflict of interests regarding the publication of this paper.
O. Maron and T. Lozano-Pérez, A Framework for Multiple-Instance Learning. Advances in Neural Information Processing Systems (NIPS), vol. 10, MIT Press, 1998.
Z. H. Zhou, “Multi-instance learning: a survey,” Tech. Rep., Department of Computer Science and Technology, Nanjing University, 2004.View at: Google Scholar
S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector machines for maultiple-instance learning,” Advances in Neural Information Processing Systems, vol. 15, MIT Press, no. Cambridge, Mass, USA, pp. 561–568, 2003.View at: Google Scholar
Q. Zhang, S. A. Goldman, W. Yu, and J. E. Fritts, “Content-based image retrieval using multiple-instance learning,” in Proceedings of the International Conference on Machine Learning (ICML '02), vol. 2, pp. 682–689, 2002.View at: Google Scholar
Z.-H. Zhou, Y.-Y. Sun, and Y.-F. Li, “Multi-instance learning by treating instances as non-I.I.D. samples,” in Proceedings of the 26th International Conference on Machine Learning (ICML '09), pp. 1249–1256, Montreal, Canada, June 2009.View at: Google Scholar
H. Wang, F. Nie, and H. Huang, “Learning instance specific distance for multi-instance classification,” in Proceedings of the 25th AAAI Conference on Artificial Intelligence, pp. 507–512, San Francisco, Calif, USA, August 2011.View at: Google Scholar
P. Viola, J. C. Platt, and C. Zhang, “Multiple Instance boosting for object detection,” in Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS '05), vol. 18, pp. 1417–1424, December 2005.View at: Google Scholar
B. Babenko, N. Verma, P. Dollár, and S. J. Belongie, “Multiple instance learning with manifold bags,” in Proceedings of the 28th International Conference on Machine Learning (ICML '11), pp. 81–88, 2011.View at: Google Scholar
C. Leistner, A. Saffari, and H. Bischof, “Miforests: multiple-instance learning with randomized trees,” in Proceedings of the European Conference on Computer Vision (ECCV '10), pp. 29–42, Springer, Berlin, Germany, 2010.View at: Google Scholar
B. Zeisl, C. Leistner, A. Saffari, and H. Bischof, “On-line semi-supervised multiple-instance boosting,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '10), p. 1879, IEEE, San Francisco, Calif, USA, June 2010.View at: Publisher Site | Google Scholar
Z. Jorgensen, Y. Zhou, and M. Inge, “A multiple instance learning strategy for combating good word attacks on spam filters,” The Journal of Machine Learning Research, vol. 9, pp. 1115–1146, 2008.View at: Google Scholar
L. Sørensen, M. Loog, D. M. J. Tax, W. J. Lee, M. De Bruijne, and R. P. W. Duin, “Dissimilarity-based multiple instance learning,” in Structural, Syntactic, and Statistical Pattern Recognition, vol. 6218, pp. 129–138, Springer, 2010.View at: Google Scholar
J. Wang and J. D. Zucker, “Solving multiple-instance problem: a lazy learning approach,” in Proceedings of the 17th International Conference on Machine Learning, pp. 1119–1125, San Francisco, Calif, USA, 2000.View at: Google Scholar
T. Gärtner, P. A. Flach, A. Kowalczyk, and A. J. Smola, “Multi-instance kernels,” in Proceedings of the International Conference on Machine Learning (ICML '02), vol. 2, pp. 179–186, 2002.View at: Google Scholar
H.-Y. Wang, Q. Yang, and H. Zha, “Adaptive p-posterior mixture-model kernels for multiple instance learning,” in Proceedings of the 25th International Conference on Machine Learning, pp. 1136–1143, Helsinki, Finland, July 2008.View at: Google Scholar
D. Zhang, Y. Liu, L. Si, J. Zhang, and R. D. Lawrence, “Multiple instance learning on structured data,” in Advances in Neural Information Processing Systems (NIPS), vol. 24, pp. 145–153, 2011.View at: Google Scholar
Q. Zhang and S. A. Goldman, “EM-DD: an improved multiple-instance learning technique,” in Advances in Neural Information Processing Systems (NIPS), vol. 14, pp. 1073–1080, MIT Press, 2001.View at: Google Scholar
Y. Chevaleyre and J. D. Zucker, “Solving multiple-instance and multiple-part learning problems with decision trees and rule sets. Application to the mutagenesis problem,” in Advances in Artificial Intelligence: 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, AI 2001 Ottawa, Canada, June 7–9, 2001 Proceedings, vol. 2056 of Lecture Notes in Computer Science, pp. 204–214, Springer, Berlin, Germany, 2001.View at: Publisher Site | Google Scholar
Z. Fu and A. Robles-Kelly, “Fast multiple instance learning via L1,2 logistic regression,” in Proceedings of the 19th International Conference on Pattern Recognition (ICPR '08), pp. 1–4, IEEE, Tampa, Fla, USA, December 2008.View at: Google Scholar
J. Ramon and L. De Raedt, “Multi instance neural networks,” in Proceedings of the ICML-2000 Workshop on Attribute-Value and Relational Learning, pp. 53–60, 2000.View at: Google Scholar
Z. H. Zhou and M. L. Zhang, “Neural networks for multi-instance learning,” in Proceedings of the International Conference on Intelligent Information Technology, pp. 455–459, Beijing, China, 2002.View at: Google Scholar
O. Maron and A. L. Ratan, “Multiple-instance learning for natural scene classification,” in Proceedings of the International Conference on Machine Learning (ICML '98), pp. 341–349, Madison, Wis, USA, 1998.View at: Google Scholar
Y. Xu, T. Mo, Q. Feng, P. Zhong, M. Lai, and E. I.-C. Chang, “Deep learning of feature representation with multiple instance learning for medical image analysis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '14), pp. 1626–1630, Florence, France, May 2014.View at: Publisher Site | Google Scholar
J. Wu, Y. Yinan, C. Huang, and Y. Kai, “Deep multiple instance learning for image classification and auto-annotation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '15), pp. 3460–3469, IEEE, Boston, Mass, USA, June 2015.View at: Publisher Site | Google Scholar
X. Ding, G. Zhang, Y. Ke, B. Ma, and Z. Li, “High efficient intrusion detection methodology with twin support vector machines,” in Proceedings of the International Symposium on Information Science and Engineering (ISISE '08), vol. 1, pp. 560–564, Shanghai, China, December 2008.View at: Publisher Site | Google Scholar
D. Tomar, B. R. Prasad, and S. Agarwal, “An efficient Parkinson disease diagnosis system based on least squares twin support vector machine and particle swarm optimization,” in Proceedings of the 9th IEEE International Conference on Industrial and Information Systems (ICIIS '14), pp. 1–6, IEEE, Gwalior, India, December 2014.View at: Publisher Site | Google Scholar
Z. Wu and C. Yang, “Study to multi-twin support vector machines and its applications in speaker recognition,” in Proceedings of the International Conference on Computational Intelligence and Software Engineering (CiSE '09), pp. 1–4, Wuhan, China, December 2009.View at: Publisher Site | Google Scholar
C. W. Hsu, C. C. Chang, and C. J. Lin, “A practical guide to support vector classification,” Tech. Rep., Department of Computer Science, National Taiwan University, Taipei, Taiwan, 2013.View at: Google Scholar
P. Nemenyi, Distribution-free multiple comparisons [Ph.D. thesis], Princeton University, 1963.