#### Abstract

This paper aims at developing new theory-driven biomarkers by implementing and evaluating novel techniques from resting-state scans that can be used in relapse prediction for nicotine-dependent patients and future treatment efficacy. Two classes of patients were studied. One class took the drug N-acetylcysteine and the other class took a placebo. Then, the patients underwent a double-blind smoking cessation treatment and the resting-state fMRI scans of their brains before and after treatment were recorded. The scientific research goal of this study was to interpret the fMRI connectivity maps based on machine learning algorithms to predict the patient who will relapse and the one who will not. In this regard, the feature matrix was extracted from the image slices of brain employing voxel selection schemes and data reduction algorithms. Then, the feature matrix was fed into the machine learning classifiers including optimized CART decision tree and Naive-Bayes classifier with standard and optimized implementation employing 10-fold cross-validation. Out of all the data reduction techniques and the machine learning algorithms employed, the best accuracy was obtained using the singular value decomposition along with the optimized Naive-Bayes classifier. This gave an accuracy of 93% with sensitivity-specificity of 99% which suggests that the relapse in nicotine-dependent patients can be predicted based on the resting-state fMRI images. The use of these approaches may result in clinical applications in the future.

#### 1. Introduction

Smoking cigarettes is the leading cause of preventable mortality in the United States, with around 50% of lifelong smokers dying from illnesses such as heart disease, stroke, and cancer [1]. In addition to this, insomnia, tremors and quivering, lightheadedness, high blood pressure, heart attack, and decreasing bone density are just a few symptoms that nicotine could cause. Nicotine has been shown to have addictive potential [2, 3]. Moreover, nicotine has been shown to activate the mesolimbic dopamine system, specifically, the ventral tegmental area, which reinforces the effects of nicotine [4]. Developing a cessation treatment with a compound that will reduce a patient’s dependence on nicotine, as well as the effects of withdrawal, could help millions of people [5]. One of these new, potentially effective compounds is N-acetylcysteine (NAC) [6]. NAC (, mw: ) is a derivative of the amino acid cysteine prodrug which is approved as a mucolytic agent and an acetaminophen antidote. NAC restores the basal level of glutamate in the accumbens which may reduce the drug seeking behavior [7].

In a pilot study, a trend was observed for fewer withdrawal symptoms after smoking cessation for subjects taking NAC versus subjects taking the placebo [8]. The goal of this paper is to show that NAC affects brain functions related to addiction. NAC has been shown to normalize glutamate levels in cocaine-dependent patients [6], which indicates that it could also affect smoking behavior by changing glutamate levels. Although there have been negative trials of NAC on cessation of cocaine use in cocaine dependence, some trials do show that in subjects who were already abstinent at the start of the trial, NAC may be useful in reducing relapse in these subjects [7, 9, 10]. Functional magnetic resonance imaging (fMRI) is a noninvasive technique used for functional brain mapping [11, 12]. Interpreting fMRI connectivity maps based on machine learning algorithms represents an important step towards a computational classification model [13–15]. Employing machine learning algorithms and statistical inference methods are commonly used to implement computational models to find relations between fMRI data and related tasks. Machine learning may also result in clinical applications in the future. For instance, in mental health disorders, no clinical factors are known that can predict whether patients will respond to a specific treatment. Being able to identify which patients will respond to a treatment, using baseline neuroimaging data, could result in improved treatment outcomes, for instance, by early identification through fMRI. Thus, personalized medicine could in the future result in targeted interventions [16–20].

Previously, Smitha et al. [21] have investigated the functional connectivity applied to region of interests (ROIs) extracted by Brodmann template using CONN toolbox. The classification accuracy was less than 50% using linear support vector machines (Linear SVM). Since Linear SVM results in low accuracy, it is important to employ new methods for extracting ROIs and multivariate machine learning algorithms for classifications. In addition, the use of decision trees results in even lower accuracy [22]. Moreover, the suitability of the random subspace ensemble method for fMRI classification has been investigated [23]. It has previously been shown that decision trees were successfully used for variable selection and classification in fMRI brain activities [24–27].

The aim of this study was to be able to develop a way to use fMRI data to predict which patients will relapse and which will not. This classification is normally done after 6 months of treatment or 12 months past the start of addiction treatment [28, 29]. It is based on self-reports by the addicted persons. When it is possible to classify responders to NAC and placebo based on the fMRI data using machine learning algorithms, there would be a tool which could be used in a future sample, to predict who will relapse after a mixed intervention and who needs additional treatment when trying to stop. This would be useful in clinical practice to make earlier decisions about when to start NAC treatment (based on their fMRI classification): at the start of treatment, instead of having to wait for 6 or 12 months to see whether NAC has effects.

In this study, four different machine learning classifiers along with features based on high activity areas in the brain extracted by a novel scheme for voxel selection were employed. Areas of high activity are defined to be those where more oxygen-rich blood is flowing and fMRI is able to map these areas. In addition to this, the accuracy of classification will rely heavily on how the data is reduced. Thus, three different data reduction techniques were employed to perform this study. Tahmassebi et al. [30] have recently shown that components could be employed as the optimal number of features extracted using their novel voxel selection scheme in the fMRI smoking cessation study. First, validation of the voxel selection schemes as well as the machine learning classifiers is required. Therefore, the extracted features from fMRI brain scans were employed to predict who received the drug NAC and who received a placebo. Although, this prediction will not have any clinical utility, it would guarantee that the machine learning models are valid and ready to be applied for prediction relapse versus nonrelapse.

#### 2. Data Acquisition

The main goal of this study was to determine whether or not the drug N-acetylcysteine (NAC) would decrease nicotine dependency. NAC may have an effect on relapse in smoking cessation [8, 31]. In this regard, regular smokers participated in this treatment study at the Spinoza Center of the University of Amsterdam. heavy smokers who wanted to quit took the drug NAC (class ) and the other subjects took a placebo (class ) for two weeks. Anatomical and functional scans of their brains were taken at baseline and after two weeks of NAC treatment. Then, the relapse data were assessed at six months after NAC treatment [32]. The Spinoza Center of University of Amsterdam is equipped with a T Intera MRI scanner (Philips Health care, Best, The Netherlands) with a 32-channel SENSE head coil to obtain MRI data. The subjects were asked to keep their eyes closed, stay relaxed, and stay awake during the scan (resting state). Two hundred 3-dimensional functional images of the subjects’ brains of size with a voxel size of with seconds as repetition time were taken due to the sensitivity to blood oxygen level-dependent (BOLD) contrast by the gradient-echo planar sequence. In addition to this, the 3-dimensional anatomical data of size with a voxel size of have been acquired. Figure 1 shows slices of the brain from one patient in all three axes for both the anatomical (Figure 1(a)) and the functional (Figure 1(b)) representations.

**(a) Anatomical**

**(b) Functional**

Magnetic fields of the nuclei in oxygen-rich blood are flipped due to the combination of a strong magnetic field and radio waves. This produces a detailed map of the regions where the ratio of flow of oxygen-rich blood to the brain is high which explains the high activity areas of the brain. This is called BOLD signal which is also studied in this paper. The BOLD signal is generally modeled as the convolution of the stimulus function with Hemodynamic Response Function (HRF) [33–36]. The energy due to an influx of oxygenated blood to a local area of neuronal activity produces the BOLD signal. Oxygenated hemoglobin has a diamagnetic effect. However, hemoglobin would show paramagnetic characteristics once it is deoxygenated. The MRI machine produces a magnetic field which aligns the randomly oriented atomic nuclei within the direction of the magnetic field [21, 37].

#### 3. Data Preprocessing

We were given the fMRI data in 4-dimensional spatiotemporal NIFTI (Neuroimaging Informatics Technology Initiative) format. The data contains subject-dependent artifacts due to the long process of the scans, possible movements of the subjects, and physiological noise [38]. The fMRI data analysis pipeline is made using a combination of Statistical Parametric Mapping (SPM12) and FMRIB Software Library (FSL) to increase the BOLD contrast to noise ratio.

The preprocessing stage [39] includes the following. Motion correction to shift the voxels location which would not match the anatomical location. 1% variation of the signal due to the movement of the subject can be greater than the BOLD signal which are extracted as features. Segmentation and realignment to correct in-plane rotations and translations of the head within the image by selecting the first image as the reference. Temporal slice timing to correct and shift in order the interleaved fashion slices. Smoothing to increase the signal-to-noise (SNR) ratio using a general linear model along with a Gaussian to approximate the haemodynamic response function (HRF) [40, 41]. Normalization using a Gaussian full width half maximum (FWHM) kernel of mm for each voxel. Coregistration to map functional information into anatomical space using Talairach standard coordinates [42] and the Montreal Neurological Institute (MNI) [43] brain templates. Figure 2 demonstrates the correlation matrix of raw (Figure 2(a)) and preprocessed (Figure 2(b)) data. It should be noted that Figure 2 is reproduced from Tahmassebi et al. (2017) [44] [under the Creative Commons Attribution License/public domain].

**(a) Raw**

**(b) Preprocessed**

To extract high activity features from the big data, a novel voxel selection scheme (mask) [30, 45] was applied on 94,720,000 features of the preprocessed data for each subject. A mask is a 3-dimensional array of 0 s and 1 s, where 1 signifies keeping the voxel in that position and 0 indicates ignoring the voxel in the data. This makes the problem more feasible to feed the machine learning algorithms with 94,720 features which has share of the given big data.

#### 4. Features Extraction

Dealing with a size of 94,720,000 as the feature matrix with available computational equipment would be computationally expensive and inaccurate. After applying voxel selection scheme, the size of the feature matrix was reduced to 94,720, which is still a large number for a feature vector for classification. Here, three robust algorithms were employed for data reduction to find the feature matrix.

##### 4.1. Independent Component Analysis (ICA)

ICA [46] is a technique to separate a multivariate signal into multiple independent non-Gaussian signals. It extracts the hidden spatiotemporal structure in neuroimaging. Maximum independence of underlying signals is the essential assumption of ICA. In principle, independence of two random variables would lead to uncorrelation between variables [47]. ICA demixes the recorded brain signals at each voxel and extracts the relevant components along with the artifacts [48]. There are some algorithms such as infomax, JADE, and FastICA to employ this approach [49]. In this paper, the FastICA algorithm was used. The FastICA is a hierarchical and symmetric approach which minimizes the mutual information by employing non-Gaussianity and measurement of negentropy [48]. ICA tries to extract a feature matrix like from the full rank matrix . With having patients and features for each patient, the feature matrix size would be . It is desired to find a good approximation with respect to the source, , which is an unmixing matrix providing a linear decomposition of . Thus, for extracting features out of features, we have

Feeding vector into various classifiers would be the next level. We hope that ICA will tell us what regions of the brain share similar brain activities.

##### 4.2. Principal Component Analysis (PCA)

PCA transforms features of the original input data orthogonally to a new space to reduce redundancy and reach high information density. These variables in the new space are principal components which are linearly uncorrelated. PCA is also referred to as Karhunen-Loeve transformation or the Hotelling transform [50]. PCA normalizes the input data within the unit interval and chosen based variance. It was shown that better discriminatory results were found by choosing a larger variance. The first principal component has the largest variance within the data set and the last one has the smallest variance. Orthogonality of all principal components to each other gives us an orthogonal basis set. Let be the correlation matrix and the corresponding th eigenvalue of the matrix and the eigenvector matrix with columns. It should be noted that is an orthogonal matrix . By spectral theorem we have

We could create a new basis by choosing eigenvectors to rewrite the original data again using as the coefficient vector, which is the projection of onto the principal directions:

To demonstrate the importance of PCA, which is reducing the dimension of the data, we could have a rank approximation of the original data:

Employing all the above-mentioned equations leads us to a strategy such as subspace decomposition to find the largest eigenvalue and project the original data orthogonally onto its subspace. Thus, the eigenvalues closer to zero which contain the redundant information will be discarded [50].

##### 4.3. Singular Value Decomposition (SVD)

SVD is the factorization of matrix to the form , where is an unitary matrix, is an diagonal matrix, and is an unitary matrix. The diagonal values of are the singular values of the original matrix and the columns of and are the left and right singular values of the original matrix, respectively. In this paper, the diagonals of the matrix are used since this tells us the properties of the matrix. It can be used to compare with the other matrices and reduces the dimension of the matrix

The performance of different data reduction algorithms with different numbers of components has been previously investigated, which suggests that 10 components would be ideal for this analysis [30, 44, 51]. Figure 3 depicts the correlation matrices produced by getting the correlation of the independent components (Figure 3(a)), principal components (Figure 3(b)) and singular values (Figure 3(c)).

**(a) ICA**

**(b) PCA**

**(c) SVD**

#### 5. Classification

There are two classification strategies for medical images. For the first strategy, measurements of a set of features from a region in an image would be employed as the feature vector. This is called region-based classification. For the second strategy, voxel-based classification is used, where contextual or noncontextual information about every single voxel is used as feature vector to feed into the classifier [50]. In this paper, multivariate voxel-based analyses along with various classifiers such as decision tree and Gaussian Naive-Bayes with two implementations were employed [52].

##### 5.1. Decision Tree Algorithm

Decision tree is a machine learning algorithm which partitions a set of input data recursively. A decision tree structure is made of a root node, which has no incoming edges and zero or more outgoing edges, internal nodes, each of which has one incoming edge and more outgoing, and leaf nodes, each of which has one incoming and no outgoing edges [53]. Based on the volume of the data and available memory resources, decision tree algorithms can be implemented in a serial fashion such as CART (classification and regression tree) [54], C4.5 [55], and IDE3 (iterative dichotomizer 3) [56, 57] or parallel fashion such as SLIQ (supervised learning in ques) [58] and SPRINT (scalable parallelizable induction of decision trees) [59]. Dealing with a feature matrix of size after data reduction and voxel selection scheme motivates us to employ a serial algorithm to implement the decision tree. In addition to this, decision trees are easy to interpret by Boolean logic and can also be visualized.

###### 5.1.1. CART

The CART, by Breiman et al. [54], builds both classification and regression trees based on binary splitting of the features selected based on Langs et al. [60] index splitting measure. In principle, it follows Hunt et al.’s algorithm [61]. It yields the largest information gain at each node. For a given training data set and label vector , CART partitions the space recursively such that the matched instances and labels are clustered together [62]. For the amount of data at node , CART partitions the data into and subsets based on the splitting threshold and feature. The impurity at node is calculated through the impurity measure Gini as with as the proportion of class observations in node [62]. For a data set of size , the run time cost order to build the tree is and the query time is .

Based on the CART algorithm of how to build trees, if the splitting process continues to the point that there are few samples in each leaf of the tree, it is likely to overfit the data. On the other hand, a small tree also might not capture the important structural information about the sample space. This problem is known as the horizon effect. Therefore, the complexity of the tree in such a way that the estimated true error is low is desired. In this regard, a reduced error pruning algorithm, which is a bottom-up fashion pruning method, was employed. This improves predictive accuracy by starting at the leaves and replacing each node with its most popular class to reduce overfitting and increase the simplicity of the tree and speed of the process. This process continues until the prediction accuracy is not affected. The optimization part was repeated 51 times for each of the data reduction methods to reach the most efficient result.

##### 5.2. Naive-Bayes Algorithm

Naive-Bayes is a classification technique based on Bayes Theorem with an assumption of independence among predictors to model probabilistic relationships between the feature matrix and the class labels [53]. In simple terms, a Naive-Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Bayes Theorem combines prior knowledge of the classes with new evidence gathered from training data [53]. First, the Naive-Bayes model builds the frequency table of the training data set. Then, it creates the likelihood table by calculating the probabilities. Finally, it calculates the posterior probability for each class and the class with maximum posterior probability is the result of the prediction. Naive-Bayes classifier is easy to implement, useful for big data problems, and known to outperform even highly sophisticated classifiers. In this paper, a standard algorithm for Gaussian Naive-Bayes and an optimized version of Naive-Bayes were employed.

###### 5.2.1. Gaussian Naive-Bayes (GNB)

The essential principle in Bayes method is assuming a known a priori and then minimization of the classification error probability, respectively. The class-conditional density function could be known or estimated from the available training dataset. During Bayesian estimation, the training set conditioned density function is getting updated by the training set which acts as observations to allow the conversion of the a priori information into an a posteriori density [50].

A simple introduction can be given by considering two pattern classes: NAC and Placebo. To make the mathematical notations easier, we call them and , respectively. Recalling the Bayes rule, we have

For the two-class patterns and , we have two Bayes classification rules as follows:

Considering the Gaussian probability distribution function with as the mean value and as the covariance matrix for discriminant functions makes it more feasible to be solved.

By choosing a monotonic logarithmic discriminant function we have

Now, by calculating the mean vector and covariance matrices of the discriminant function for each class from the training data, the data can be separated by a hyperplane (if they have an equal covariance matrix) or hyperquadrics (if they have an unequal covariance matrix).

###### 5.2.2. Optimized Naive-Bayes (ONB)

To optimize the Naive-Bayes standard algorithm, the bag-of-token model was employed [63]. In this way, the value of each feature is calculated based on the nonnegative number of occurrences of token in the observation. For estimated probability we havewhere is the weighted number of occurrences of token in class , is the total weighted number of occurrences of all tokens in class , and is the number of instances in the training set. Then, the classifier predicts the class label for each observation based on the estimated posterior probability presented in (10). In this way, each observation is assigned to the class with the maximum posterior probability [64].

As discussed, the participants were indeed randomized at the first session to receive NAC or placebo. In addition to this, we first learned the two classes (relapse versus nonrelapse) and then tested them in a randomized way choosing randomly participants from the two groups. Moreover, due to the low numbers of participants, -fold cross-validation was used to overcome overfitting. In principle, in an overfitted model, the underlying relationships of the entire database are not learned, which might give falsified information about the process. In this paper, 10-fold cross-validation was used to test the accuracy of the classification. In this procedure, our testing set is provided by leaving subjects out of the data set and the training set is provided by aggregating the other 9 folds (35 subjects) to use in the predictive process. Each classifier’s discriminant function was fitted on the training data set for each fold. This was done times in order to cover all the subjects in the testing set. The classification accuracy in each fold is the number of times that the model predicted correctly divided by the number of subjects in the testing fold, (4). The total classification accuracy is the average of the classification accuracies in all 10-folds.

#### 6. Results and Discussion

Two classification tasks with three major data reduction methods, ICA, PCA, and SVD, were developed in Python [62, 65] and MATLAB [64]. Based on the previous studies using this database, the data reduction was done by keeping 10 components [30, 44, 51]. The classification tasks were done in two different phases: validation and relapse. For the model validation as discussed in Section 6.1, the classification was done based on two classes NAC (class ) and placebo (class ). This will not have any clinical benefits, but it would guarantee that the voxel selection schemes which were employed to extract features from fMRI brain scans were valid. In addition to this, due to the equal number of subjects for each class (19 NAC, 20 placebo), the classification task is unbiased. This would present the performance of the proposed model without overfitting. Next, classification was done to separate subjects into two groups of subjects (relapse (class ) and nonrelapse (class )). In this case, class labels were made based on the number of cigarettes that each subject smoked after the treatment for six month. In this paper, people who smoked less than cigarettes during the six months past the treatment were chosen for nonrelapse class ( subjects) and the rest of the subjects were assigned to the relapse class ( subjects). In this section, the results regarding the classification tasks for the model validation and the relapse are presented and discussed.

##### 6.1. Validation

As discussed, the pruning process of the cart using the reduced error pruning algorithm was repeated times per data reduction algorithm. Table 1 presents the statistical parameters of the pruning process. In addition to this, the optimization process of the CART misclassification error is shown in Figure 4 for different numbers of terminal nodes. For each of the reduction techniques, cross-validation (solid purple line) and resubstitution (dashed pink line) were done and the best choice (magenta circle) based on the cross-validation accuracy was reported. It is preferred to use a simpler tree if it is roughly as good as a more complex tree according to Occam’s Razor Principle. Therefore, a cut-off value equal to the minimum cost plus one standard error (dotted black line) was employed.

**(a) ICA**

**(b) PCA**

**(c) SVD**

For the learning phase as Figures 4(a), 4(b), and 4(c) picture, the resubstitution error decreased as the number of terminal nodes increased. It is assured that the classifier can predict the training dataset very well. The most complex tree was made by SVD with terminal nodes. This pattern normally was expected to be repeated. However, for the cross-validation error in Figure 4(c) until the second terminal nodes of the tree, the classifier did not predict the testing samples very well. This might be the reason why the model finished with terminal modes. The lowest misclassification error for 10-fold cross-validation was presented in Figure 4(a) and also in Table 1 with for ICA.

To summarize the number of samples predicted correctly (true positive (TP) for positive class, and true negative (TN) for negative class) and incorrectly (false positive (FP) for positive class and false negative (FN) for negative class) they were calculated as a confusion matrix [53]. The fraction of positive instances predicted correctly defines true positive rate (TPR) or sensitivity. In addition to this, true negative rate or specificity is defined as the fraction of correct prediction of negative instances. Figure 5 shows the confusion matrices of classification using the CART with different data reduction methods.

**(a) ICA**

**(b) PCA**

**(c) SVD**

As shown in Figure 5, the rows of the confusion matrix plot correspond to the predicted class (Output Class), and the true class (Target Class) is shown as the columns of the confusion matrix plot. In addition to this, the prediction accuracy of the classes of observations by the trained classifier is depicted on the diagonal cells while the off diagonal cells depict where the classifier failed at predictions of the classes of observations. The column on the far right of the plot shows the accuracy for each predicted class, while the row at the bottom of the plot shows the accuracy for each true class. The overall accuracy is shown in the bottom right cell of the plot [64].

As seen in Figure 5, the best classification accuracy was reported for SVD (Figure 5(c)). ICA and SVD feature extraction schemes performed well on placebo predictions since the CART predicted all the subjects in placebo class correctly (). Employing SVD along with the CART, out of placebo predictions, were correct and were incorrect. Out of NAC predictions, were correct, and out of subjects in the placebo class, were correctly predicted as placebo with zero error. Out of the subjects in the NAC class, were correctly classified as NAC and were classified as placebo. Overall, of the predictions were correct and were incorrect classifications based on the cross-validation. The overall accuracy for ICA and PCA was and , respectively. It should also be noted that the model with SVD being more complex was already discussed. This complexity becomes more clear in Figure 6(c). As we compare the tree evolution of the models, it can be seen that the ICA model can improve with reduced complexity and high accuracy. Moreover, the contour plots of the probability surface for classification using the CART were presented in Figure 7. In a statistical-classification, a decision boundary or decision surface is a hypersurface that partitions the underlying feature space into two sets, one for each class. The classifier classifies all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class. Decision boundaries are not always clear-cut. In fact, they are the transition from one class in the feature space to another, which is not discontinuous, but gradual. In this plot, red color was chosen to demonstrate the highest probability and blue color represents the lowest probability as well. The black scatter points are the samples predicted by the classifier for each class. It can be seen that it is less probable to predict a subject in the surface area with blue color.

**(a) ICA**

**(b) PCA**

**(c) SVD**

**(a) ICA**

**(b) PCA**

**(c) SVD**

In classification with GNB, as shown in Figure 8, the confusion matrices contain high values for FP, which means the classifier failed at the prediction of the subjects in the class NAC. The low specificity can be seen clearly in Figure 8(b) where the GNB just predicted subjects in the NAC class and did not detect the other 15 subjects in that class. That is the reason FP is , which is relatively high for any classification task. Furthermore, this can also be seen clearly in Figure 9, where scatter black points can be found in blue parts of the surface areas, specifically for class prediction. This matches the previous results with confusion matrices and confirms that, overall, the GNB classifier could not predict the subjects in the NAC class for different feature matrices based on ICA, PCA, and SVD very well. This is more obvious in Figure 9 where the classifier has predicted the subjects in the NAC class with probability and the subjects in the placebo class with a probability of more than . Results in Figure 9 confirm the accuracies presented in Figure 8.

**(a) ICA**

**(b) PCA**

**(c) SVD**

**(a) ICA**

**(b) PCA**

**(c) SVD**

Similar to the pruning process of the CART, the ONB algorithm ran for times as well, and the results are presented in Table 2. The performance of the optimization algorithm is obvious by comparing the confusion matrices in Figures 8 and 11. For example, the accuracy for GNB with ICA is improved by . More importantly, the optimization algorithm improved the strength of the classifier in prediction of the subjects in the NAC class. As shown in Figure 8(a), GNB predicted , , and subjects in the NAC class for ICA, PCA, and SVD algorithms, respectively. On the other hand, the correct number of predictions for the NAC class was almost doubled as shown in Figure 11(a). The best result, as presented in Table 2, was achieved using ICA. This is better depicted in Figure 10, which pictures the convergence of the minimum estimated and observed cross-validation error. As seen, the estimated error matches the observed error with ICA. On the other hand, the estimated error started fluctuating after function evaluations for PCA and SVD.

**(a) ICA**

**(b) PCA**

**(c) SVD**

**(a) ICA**

**(b) PCA**

**(c) SVD**

To display the trade-off between sensitivity and specificity of the classifiers, Receiver Operating Curves (ROC) have been employed [53]. The closer the curve follows the left-hand border and the top border of the ROC space, the more accurate the test. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. The ROC curve visualizes the detection performance of the classifier based on TPR and FPR. As the number of features increases, the classifier’s performance increases until we reach the optimal number of features due to the curse of dimensionality. Here, 10 components were chosen as the optimal number. It should be taken into account that increasing the number of features decreases the performance of the classifier. The area under the ROC curve (AUC) for each data reduction method was also calculated. An AUC of confirms the ideal performance of the classifier and an AUC of confirms that the performance of the classifier is not acceptable.

Figure 12 shows the ROC curves for the CART, the GNB, and the ONB classifiers with the three feature selection schemes: ICA, PCA, and SVD, respectively. The performance of the ICA as the feature selection scheme along with the ONB with an AUC of and the CART with an AUC of as classifiers is the best. Moreover, the CART classifier shows an acceptable performance along with all the employed data reduction algorithms. This gives the conclusion that the ICA extracted structural information to be classified by the CART.

**(a) ICA**

**(b) PCA**

**(c) SVD**

##### 6.2. Relapse

Next, the validated model (previously presented in Section 6.1) was employed to predict relapse in nicotine-dependent subjects. Similarly, the CART model underwent an optimization procedure using the reduced error pruning algorithm using 10-fold cross-validation with 51 runs to achieve the best estimation of the error. Table 3 presents the statistical parameters of the estimated error of the CART. As seen, both errors for the original tree and the pruned tree are presented and the best estimated error was achieved using PCA as the feature extraction scheme.

Figure 13 demonstrates the bias-variance trade-off phenomenon [66]. The resubstitution error is the training error of the built tree for prediction of the subjects in terms of relapse and nonrelapse. As shown, the substitution error decreased through the number of terminal nodes for each of the employed data reduction algorithms. The tree was made using , , and terminal nodes for ICA, PCA, and SVD, respectively. Based on Table 3 and Figure 13(b), it can be concluded that the PCA algorithm along with the CART showed the best performance. The cross-validation estimate of the true error rate decreases as we increase the number of terminal nodes.

**(a) ICA**

**(b) PCA**

**(c) SVD**

As shown in Figure 4(a), the cross-validation error rate started increasing right after the first terminal node. The dotted black line is the minimum error plus one standard deviation of the training error was defined as a metric. In this regard, the desired estimated error rate using cross-validation has to be below this line. As shown, the PCA has the largest margin with the defined metric line and the lowest error rate with with respect to ICA and SVD data reduction algorithms.

This result matches the confusion matrices presented in Figure 14. As previously discussed, it is desired to predict the relapse of the subjects in the class . However, based on the confusion matrices, the accuracy of was achieved, but the CART classifier did not predict the subjects in the nonrelapse class very well. Only out of subjects in the nonrelapse class were predicted by the CART along with the PCA feature extraction scheme. This is also shown in Figure 15(c), where the subjects in the class (nonrelapse) were technically predicted on the edge of the probability surfaces and closer to (blue color). This might be due to the nature of the decision trees which are usually suitable for balanced binary-class problems [53].

**(a) ICA**

**(b) PCA**

**(c) SVD**

**(a) ICA**

**(b) PCA**

**(c) SVD**

Alternatively, the Naive-Bayes classifier based on its probabilistic nature solves the prediction classification problems implicitly. This would be an alternative to predict the subjects in the nonrelapse class better. Table 4 illustrates the statistical parameters of the estimated true error rate using 10-fold cross-validation after 51 runs for optimized Naive-Bayes classifier.

As presented, the best error was achieved along with the SVD data reduction technique with a cross-validation error of with a prediction accuracy of as depicted in Figure 16(c). Employing a probabilistic classifier while dealing with a nonbalanced binary-class problem outperformed the CART with an accuracy of and for prediction of the nonrelapse and relapse class, respectively. This result is clearly presented in Figure 17(b), where the GNB predicted subjects out of subjects in the class as shown in the counter plot with probability value of 1 (red color).

**(a) ICA**

**(b) PCA**

**(c) SVD**

**(a) ICA**

**(b) PCA**

**(c) SVD**

By optimizing the Naive-Bayes algorithm after runs, the classifier was able to predict the subjects in the nonrelapse even better. As shown in Figure 18, the solid line is the observed error rate and the dashed line is the cross-validation estimate of the true error rate. After about function evaluations for ICA, the estimated error rate did not match the observed error rate. However, it reached the minimum estimated error of as presented in Table 4. On the other hand, the estimate of the true rate exactly matched the observed error rate employing SVD data reduction scheme and reached the minimum of .

**(a) ICA**

**(b) PCA**

**(c) SVD**

Both ICA and SVD algorithms reached the minimum estimated error, but the overall prediction accuracy of SVD was better than ICA. It is true that the ICA reached an estimated error better than PCA, but the ONB classifier along with the PCA algorithm even predicted the subjects with better accuracy. This could be due to the same nature of the PCA and SVD as previously discussed in Section 4. The ONB along with the ICA algorithm predicted all the subjects in the relapse class correctly. However, it failed at the prediction of of the subjects in the nonrelapse class as shown in Figure 19(a). In conclusion, ICA could not extract enough structural information to be employed in prediction of nonrelapse subjects for this study.

**(a) ICA**

**(b) PCA**

**(c) SVD**

Figure 20 visualizes the performance of the classifiers and the trade-off between sensitivity and specificity in which any increase in sensitivity will be accompanied by a decrease in specificity. As seen, the ONB classifier along with SVD predicted the subjects in relapse and nonrelapse classes with an AUC of , which is an extraordinary result. In addition to this, PCA along with the ONB classifier also showed a reasonable accuracy with an AUC of . This might be due to the similar nature of the principal components and singular values, which showed similar performance for this database. As seen, the ONB classifier that did a phenomenal job separating the relapse and nonrelapse classes has an ROC curve that comes close to the upper left corner of the plot. It should be noted that even ICA, along with the ONB classifier, showed reasonable results in predicting subjects in the relapse class. However, it seemed that ICA could not extract structural features to detect the subjects in the nonrelapse class. This might be the reason that the FN values were high. This is so obvious by considering ICA along with the CART classifier, where the line represents a classifier that did not do better than random prediction.

**(a) ICA**

**(b) PCA**

**(c) SVD**

#### 7. Conclusions and Future Work

The scientific goal of this study was to develop new theory-driven biomarkers by implementing and evaluating novel techniques from resting-state brain scans that can be used in relapse prediction for nicotine-dependent patients and future treatment efficacy. Two classes of patients were studied, one took the drug N-acetylcysteine and the other took a placebo. The patients underwent a double-blind smoking cessation treatment and the resting-state fMRI of their brains before and after treatment was recorded. The high dimensionality of the fMRI data taken from patients made it difficult for the classification tasks to employ the original preprocessed data as input. In this regard, data reduction algorithms including ICA, PCA, and SVD were employed. The CART decision tree and the Naive-Bayes classifier with two different implementations were chosen for the classification tasks. Based on the results, the following conclusions can be drawn:(1)The proposed model, including the features extracted from the resting-state fMRI brain scans, was validated by classifying the subjects into NAC and placebo classes. The optimized Naive-Bayes along with independent component analysis gave an accuracy of with sensitivity-specificity of for the validation set. The validated feature model based on singular value decomposition along with optimized Naive-Bayes classifier gave an accuracy of with sensitivity-specificity of for the relapse set.(2)The validation results indicated that independent component analysis can be employed to extract structural information to be used in a balanced-unbiased classification problem using both explicit and implicit classification algorithms.(3)The relapse results showed that singular value decomposition would extract critical features to be used in an unbalanced-biased classification problem employing implicit classification algorithms.(4)The relapse results showed that interpreting fMRI connectivity maps based on machine learning algorithms might result in developing novel theory-driven biomarkers with clinical applications in future.(5)All the analysis was based on the difference between baseline and follow up scans, which were acquired in the resting-state. Thus, it can be assumed that any difference between the groups is the result of NAC, since this was the only variable that differed between the groups. In addition to this, there were no differences between the groups at baseline, except the motivation to change their behavior.

In future work, deep learning and convolutional neural networks (CNN) can be employed to maximize the information for the training process. As discussed, the fMRI data is given in 4-dimensional NIFTI format (three spatial dimensions, one temporal dimension). In fact, a 3D movie for each subject for 200 snapshots before and 200 snapshots after the treatment is given. In this approach, the CNN model can be trained employing different kernels, subsampling, max-pooling, and padding, along with fully connected networks to maximize the amount of meaningful information in the training process. The trained model can be used in relapse prediction based on fMRI scans without using any voxel selection schemes.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

#### Acknowledgments

This work is partially supported by ONR Grant “Gulf of Mexico Spring School (GMSS) Deep Learning Workshop.” In addition to this, the authors would like to thank Dr. Lauren Bottorf and Eitan Lees for the careful revision of the final version of the manuscript.