Autism spectrum disorder (ASD) is a developmental disorder that impacts more than 1.6% of children aged 8 across the United States. It is characterized by impairments in social interaction and communication, as well as by a restricted repertoire of activity and interests. The current standardized clinical diagnosis of ASD remains to be a subjective diagnosis, mainly relying on behavior-based tests. However, the diagnostic process for ASD is not only time consuming, but also costly, causing a tremendous financial burden for patients’ families. Therefore, automated diagnosis approaches have been an attractive solution for earlier identification of ASD. In this work, we set to develop a deep learning model for automated diagnosis of ASD. Specifically, a multichannel deep attention neural network (DANN) was proposed by integrating multiple layers of neural networks, attention mechanism, and feature fusion to capture the interrelationships in multimodality data. We evaluated the proposed multichannel DANN model on the Autism Brain Imaging Data Exchange (ABIDE) repository with 809 subjects (408 ASD patients and 401 typical development controls). Our model achieved a state-of-the-art accuracy of 0.732 on ASD classification by integrating three scales of brain functional connectomes and personal characteristic data, outperforming multiple peer machine learning models in a k-fold cross validation experiment. Additional k-fold and leave-one-site-out cross validation were conducted to test the generalizability and robustness of the proposed multichannel DANN model. The results show promise for deep learning models to aid the future automated clinical diagnosis of ASD.

1. Introduction

Autism spectrum disorder (ASD) has been estimated to occur in more than 1.6% of children aged 8 across the United States [1]. As a chronic neurological condition, ASD is characterized by impairments in social interaction and communication, as well as by a restricted repertoire of activity and interests [25]. Patients with ASD exhibit different levels of impairments, ranging from above average to intellectual disability. In neuroscience, ASD remains a formidable challenge, due to their high prevalence, complexity, and substantial heterogeneity, which require multidisciplinary efforts [68]. Although clinical therapies have been developed to treat the symptoms, the diagnosis of ASD remains to be a challenging task. Currently, behavior-based test is the standard clinical method for diagnosing ASD [9]. However, the diagnostic process for ASD is not only time consuming but also costly [10]. This results in a tremendous financial burden for patients’ families. Meanwhile, with this lifetime ASD, the patients may have difficulties in normal socialization and working environments, increasing the overall social costs. Therefore, an automated diagnosis approach is desirable for earlier identification of ASD.

Machine learning is a promising tool for investigating the replicability of patterns across larger, more heterogeneous datasets [1113]. For automated diagnosis of ASD, personal characteristic (PC) data, such as intelligence quotient (IQ) and Social Responsiveness Scale (SRS) score have been adopted in several studies [1416]. In the study of ASD, IQ is a type of standard score that is derived from several standardized tests designed to assess human intelligence, and the SRS score includes a 65-item standardized questionnaire regarding behaviors that are associated with ASD [17]. ASD is highly associated with intellectual disability which is mainly measured by IQ. Meanwhile, some studies [18, 19] indicate that IQ discrepancy marks a meaningful phenotype in ASDs. In this way, IQ becomes an important biomarker to classify the ASD.

Neuroimaging data have also been investigated to explore ASD biomarkers in recent decades. To facilitate the ASD research community, Autism Brain Imaging Data Exchange (ABIDE), an international collaborative project, has collected data from over 1,000 subjects (e.g., structure MRI (sMRI), resting-state functional MRI (rs-fMRI), and PC data) and made the whole database publicly available. This provided a common platform to test hypotheses, search key biomarkers, and develop advanced statistical and machine learning algorithms. For example, Ghiassian et al. [20] proposed an automated classifier by combining the histogram of orientated gradients approach for feature extraction from sMRI and rs-fMRI data and support vector machines (SVMs) for decision making. Their method was tested on the ABIDE dataset and achieved 65.0% accuracy on hold-out set. Of late, Sen et al. [21] developed a LEFMS learner, which applies sparse autoencoder to extract features from sMRI and spatial nonstationary independent components on rs-fMRI data. SVM was the utilized to classify ASD and improved accuracy by 0.042. Katuwal et al. [22] applied a random forest classifier to classify ASD and achieved an AUC of 0.61. Adding verbal IQ and age to morphometric features, AUC was improved to 0.68. By introducing hypergraph learning technique, Zu et al. [23] proposed a novel learning method to discover complex connectivity biomarkers that are beyond the widely used region-to-region connections in the conventional brain network analysis.

Deep learning has had a profound impact on many data analytic applications, such as speech recognition, image classification, computer vision, and natural language processing [24]. Based on data-driven feature construction, deep learning provides a new direction for data analytic modelling. Over the past few years, an increasing body of the literature confirmed the success of feature construction using deep learning methods. Deep learning has been demonstrated to outperform traditional machine learning algorithms on numerous recognition and classification tasks [2429], which inspires the researchers in the ASD community to apply deep learning approaches on ASD classification. Earlier, deep neural networks (DNNs) have been applied to identify ASD patients using rs-fMRI [26]. Their model achieved 70% on accuracy by using the functional connectivity (FC) matrix as features for model training.

Kong et al. [27] constructed individual functional brain networks using the rs-fMRI data from 182 subjects of NYU Langone Medical Center, a data site within ABIDE repository. FC features were used to represent the networks of all subjects and further ranked using F-score. Then, a stacked sparse autoencoder-based DNN model was developed. Significant performance improvement was achieved by comparing the proposed method with two existing algorithms.

More recently, an ASD-DiagNet, a joint learning procedure using an autoencoder and a single layer perceptron, was presented [28]. A data augmentation strategy was also designed for the FC features of functional brain networks based on linear interpolation of available feature vectors to ensure the robust training of the ASD-DiagNet. By evaluating the model on 1035 subjects from 17 different sites of ABIDE repository, ASD-DiagNet achieves 70.1% on the accuracy, 67.8% on sensitivity, and 72.8% on specificity in 10-fold cross validation. In the mode evaluation of individual data centers, ASD-DiagNet outperformed other state-of-the-art methods and increased the accuracy performance up to 20% with a maximum accuracy of 80%.

In this work, we aim to develop a novel deep learning model for automated diagnosis of ASD. Specifically, we proposed a multichannel deep attention neural network, called DANN, by integrating multiple layers of neural networks, attention mechanism, and feature fusion to capture the interrelationships in multimodality data (functional neuroimaging data and PC data) to distinguish ASD patients from typical development controls (TDCs). The attention mechanism-based learning is a type of deep learning which is a recent trend for understanding what part of historical information weighs more in predicting diseases [30, 31]. Taking advantage of large heterogeneous dataset from ABIDE, multiscale brain functional connectomes and PC data were obtained as the features. We systematically evaluated the diagnosis power of our multichannel DANN on ASD classification and compared the performance of the proposed model with peer machine learning models.

The rest of paper is organized as follows. Section 2 describes ASD data and multichannel deep attention neural network. The experimental setup is shown in Section 3, followed by the experimental results and discussion in Section 4. Finally, the conclusion of this work is described in Section 5.

2. Materials and Methods

2.1. Subjects

We collected preprocessed rs-fMRI and PC data from 809 subjects from publicly accessible ABIDE repository, including 408 ASD subjects and 401 TDC subjects. Detailed demographic information of subjects is listed in Table 1. The incidence of ASD between male and female subjects is significantly different, and thus the majority of the subjects in ABIDE dataset are male. There is no significant difference between the age of ASD and TDC groups. All three IQ scores had significant difference between two groups. Later, the variables’ gender, age, and three IQs were used as PC data in our ASD classification experiments.

2.2. Data Preprocessing

Each of rs-fMRI data has been preprocessed using Configurable Pipeline for the Analysis of Connectomes (CPAC) preprocessing pipeline, which includes slice timing correction, motion realignment, and intensity normalization. Nuisance variable regression was implemented through bandpass filtering and global signal regression strategies to clean confounding variations introduced by heartbeats and respiration, head motion, and low-frequency scanner drifts. Furthermore, boundary-based rigid body and FMRIB’s linear and nonlinear image registration tools were used to register functional to anatomical images. Then, both functional and anatomical images were normalized to template space (MNI 152). Three scales of brain functional connectomes were extracted in this work. Mean blood oxygen-level dependent (BOLD) time-series signals for three sets of regions of interests (ROIs), i.e., atlases, including the Automated Anatomical Labeling (AAL) atlas, Harvard-Oxford (HO) atlas, and Craddock 200 (CC200), were calculated. The weights of functional brain connectivity were defined using Pearson’s correlation coefficient between any pair of two ROIs. For AAL atlas, each subject was represented by a FC adjacency matrix, symmetric along diagonal, in which each entry represents the brain connectivity between each pair of ROIs. Similarly, each rs-fMRI data was also represented by and symmetric FC adjacency matrices using HO and CC200 atlases, respectively. In addition, from 809 subjects, we obtained five PC data, including sex, handedness, full-scale IQ (FIQ), verbal IQ (VIQ), and performance IQ (PIQ).

2.3. Multichannel Deep Attention Neural Network
2.3.1. Overview Structure

An overview of multichannel DANN is given in Figure 1. It consists of blocks of multichannel inputs, multilayer perceptron (MLP), self-attention, fusion, and aggregation. The various components are described in the following sections.

2.3.2. MLP

The MLP block is composed of 5 layers, which are one dropout layer and four dense layers. The details of the block are shown in Figure 2.

A dropout layer, which prevents overfitting during training the model, is applied on input data, e.g. AAL FC (input size is 4005). The white circle in Figure 2 denotes dropped units according to dropout probability. The dropout layer is followed by four dense layers, whose hidden units are 1024, 512, 128, and 32, respectively, and corresponding activation functions are “elu,” “tanh,” “tanh,” and “relu,” respectively.

2.3.3. Self-Attention

The attention is proposed to compute an alignment score between elements from two sources [32]. In particular, given an input FC adjacency matrix, which can be transformed into a FC adjacency sequence, and a representation of a query , attention [33] computes the alignment score between q and each element using a compatibility function . A softmax function then transforms the alignment scores to a probability distribution , where z is an indicator of which element is important to q. That is, a large means that contributes important information to q. This attention process can be formalized as

The output is the weighted element according to its importance, i.e.,

Additive attention mechanisms [33, 34] are commonly used attention mechanisms where the compatibility function is parameterized by a MLP, i.e.,where , are learnable parameters, d is the dimension of , and is an activation function. In contrast to additive attention, multiplicative attention [35, 36] uses cosine similarity or inner product as the compatibility function for , i.e.,

In practice, although additive attention is expensive in time cost and memory consumption, it usually achieves better empirical performance for downstream tasks.

Self-attention [37, 38] explores the importance of each feature to the entire FC given a specific task. In particular, q is removed from the common compatibility function which is formally written as the following equation:

The output is the weighted element according to its importance, i.e.,

2.3.4. Fusion

The fusion output u is obtained by combining the outputs of the two dense layer blocks, which can capture the correlation between the types of spaces. The combination is accomplished by a fusion gate, as shown in Figure 3, i.e.,where , is the dimension of output , and are the learnable parameters of the fusion gate.

2.3.5. Aggregation

To aggregate dense layer, self-attention, and fusion into a DANN, the outputs of self-attention and fusion blocks can be concatenated, multiplied, or averaged. In our implementation, the outputs of both the self-attention blocks and the fusion blocks are concatenated, followed by a dense layer and sigmoid layer for classification:where is a vector of the combined outputs of both the self-attention blocks and the fusion blocks. represents the concatenation of outputs , , from the self-attention blocks, , , from the fusion blocks, and from demographic data. A sigmoid function on dense lay is then used for data classification.

3. Experiment Setup

3.1. Model Evaluation

We conducted a comprehensive evaluation in this study by employing the proposed multichannel DANN on ABIDE dataset to classify the ASD subjects from TDC subjects. Two evaluation strategies, k-fold cross validation and leave-one-site-out cross validation, were designed in our experiments. For k-fold cross validation, whole ABIDE dataset would be divided into k portions. In each repeated iteration, we randomly used one portion of the data as testing data and applied the remaining (k − 1) portions of the data as training data. This process would be repeated k times until all data have been tested once. For the leave-one-site-out cross validation, we separated the whole ABIDE dataset according to their data sites. We removed the SBL site from this experiment due to its small subject size (N = 4). This resulted in a total of 12 data sites. We randomly used data from one site as testing data and treated the remaining data from 11 data sites as training data. This is repeated 12 times until data from all sites have been evaluated as testing data. Both the k-fold cross validation and leave-one-site-out experiments were repeated 50 times to understand the variability of the results. Mean and standard deviation (SD) were calculated. Student’s T-test was applied to test the difference between continuous values, and chi-square test was used for discrete values. One-way analysis of variance (ANOVA) was utilized to compare multiple conditions (i.e., multiple k-fold cross validation experiments). A value < 0.05 was used for inferring statistical significance.

We calculated true positive (TP), false positive (FP), true negative (TN), and false negative (FN) for the classification by comparing the classified labels and gold-standard labels. Then, we calculated accuracy, sensitivity, precision, and F-score by

3.2. Peer Machine Learning Models

To compare our multichannel DANN with existing machine learning models, we also implemented random forest (RF), support vector machine (SVM) models, and multichannel DNN. Each model was designed to take multimodality data as inputs.

3.2.1. Random Forest (RF)

RF is one of the classic ensemble learning methods by learning multiple decision trees to improve classification performance and control overfitting. The number of trees in the forest was optimized from empirical values [20, 40, 60, 80, 100]. We set the maximal depth of the tree as 10.

3.2.2. Support Vector Machine (SVM)

A SVM model was developed to perform ASD classification by using vectorized FC features. We applied a linear kernel and searched the margin penalty with empirical values [0.2, 0.4, 0.6, 0.8, 1.0].

3.2.3. Deep Neural Networks (DNNs)

In terms of existing deep learning model, we compared our model with a DNN model developed previously for ASD classification [26]. In brief, the compared existing DNN model is a 5-layer DNN, with input number of nodes in input layer, followed by 1024, 512, 128, and 32 nodes in hidden layers, and the output layer contains two output units. A cross entropy loss function was adopted. Learning rate was set as 0.0001. 10 epochs were applied to ensure the convergence of the model.

3.3. Developmental Environment

The proposed DANN and peer machine learning models were implemented in the Python 3.7 environment. To build the deep learning related models, we applied Keras (2.2.4) package with TensorFlow (1.13.1) backend. For the traditional models, we adopted the models from Sklearn 0.20 [39]. Statistical analyses were performed using Matlab 2019b.

All the experiments were conducted on a workstation with 10 cores of Intel Core i9 CPU and 64 GB RAM. Due to the high computation cost of deep learning algorithm, we configured one GPU (Nvidia TITAN Xp, 12 GB RAM) to accelerate the training speed of the models.

4. Results and Discussion

4.1. Performance Comparison on the Whole ABIDE Dataset

We first compared the ASD classification performance of the proposed multichannel DANN model and multiple peer machine learning models, including RF, SVM, and multichannel DNN. The results were calculated based on 50 repeats of 10-fold cross validation experiments by using the entire ABIDE dataset. The mean and SD of the performance metrics are listed in Table 2. The proposed multichannel DANN exhibited a significantly higher accuracy than multichannel DNN , SVM , and RF models. Similarly, the multichannel DANN also had better F-score than multichannel DNN , SVM , and RF models. The sensitivity of the multichannel DANN was significantly higher than that of multichannel DNN , SVM , and RF models. The specificity of the multichannel DANN was significantly higher than that of SVM and RF models but was not significantly better than multichannel DNN . Since the multichannel DNN had a relatively lower sensitivity (0.673), it achieved the best mean precision in our experiments. No significant difference was found between multichannel DNN and DANN on precision. The multichannel DANN model still exhibited higher precision than SVM and RF . Overall, the proposed multichannel DANN achieved improved ASD classification accuracy, sensitivity, F-score, and specificity among compared machine learning models, while the multichannel DNN had the highest precision.

Inspiringly, the proposed multichannel DANN significantly outperformed multichannel DNN on four of five performance metrics, increasing mean accuracy by 0.025, sensitivity by 0.072, F-score by 0.018, and specificity by 0.017. Although no significance was found, the precision of the proposed approach is slightly lower than multichannel DNN by 0.01. The attention mechanism in our model, as the name implies, aids the deep learning model to make choices about which features it should pay attention. Our model can allocate attention by adjusting the weights they assign to individual FC features. This process can decide which FC features are more important than others in terms of the ASD classification task. In another word, it optimizes the feature selection during the learning of a deep learning model. The improved performance of DANN over DNN demonstrated the validity of the attention mechanism. The results in Table 2 also showed that multichannel DANN achieved significantly improved performance, compared to traditional models SVM and RF. This is consistent with multiple previous ASD classification studies [26, 27]. The improvement was likely due to a combination of attention mechanism and the superior capability of deep learning model on complex data patterns, such as FC features.

4.2. Leave-One-Site-Out Cross Validation of Multichannel DANN

To test the generalizability of the proposed model on unseen data from different data sites, we performed a leave-one-site-out cross validation. Similar to k-fold cross validation, we reserved data from one data site as testing data and trained our model by using all data from the rest of the 11 data sites. But, since the training data were the same across all repeats, the performances have much smaller variations than k-fold cross validation. Table 3 shows the classification performance of our model and the size of subjects for each data site.

In the NYU data site that contains the largest sample size, our model achieved an accuracy of , sensitivity of 0.720 ± 0.086, the precision of 0.758 ± 0.127, F-score of 0.738 ± 0.069, and specificity of 0.689 ± 0.072. When examining data sites with more than 40 subjects, we found that our model achieved the highest accuracy (0.803 ± 0.045) on the USM site and the best F-score (0.745 ± 0.052) on the UCLA site. These two sites contain nearly 100 subjects, so the results are very informative. We also noted that the lowest accuracy our model returned was 0.684 ± 0.026 from UM site, suggesting that the data here may have variability that is different from other sites. Overall, our model reached a mean accuracy of 0.713 ± 0.022 and mean F-score 0.707 ± 0.043. This was significantly lower than accuracy () and F-score () from the cross validation results in Table 2, indicating a large data variability among different data sites.

4.3. Robustness of Multichannel DANN on Varying Data Split Schemes

Next, the robustness of our DANN was further tested using varying k-fold cross validation. A classification model that is not robust may appear to perform very differently with different k. Figure 4 shows plots of the accuracy, sensitivity, precision, F-score, and specificity of the proposed DANN over k-fold cross validation strategies . Using one-way ANOVA, the proposed DANN exhibited no significantly different performance across varying k-fold experiments (), indicating the robustness of the proposed multichannel DANN model.

4.4. Impact of Data Modality on the Classification Performance

At the end, we set to test the performance of the multichannel DANN when different data modalities are used for ASD classification. All results were based on 50 repeats of 10-fold cross validation experiment. Table 4 lists the performance of multichannel DANN on varying combinations of FC data (marked as AAL, HO, and CC200) and PC data (marked as Demo). The upper part of Table 4 contains results based on both FC and PC data, while the lower part of the table focuses on FC data only. The combined FC and PC data (AAL + HO + CC + Demo) had a better accuracy , sensitivity , and specificity than FC data alone (AAL + HO + CC), while no significant differences were observed on precision and F-score . This demonstrated the predictive power of PC data.

Without PC data, our model achieved the highest performance by combining FC from all three brain atlases. This suggests that brain connected data from different atlases may have complementary information so as to assist the ASD classification. Interestingly, the model using CC200 FC data (marked as CC in the table) performed better than FC data derived from AAL () and HO (). It is likely because that CC200 atlas is constructed from rs-fMRI data, representing a brain functional parcellation.

5. Conclusion

In summary, we developed a multichannel DANN model by applying the state-of-the-art attention mechanism-based deep learning techniques for automated diagnosis of ASD. The k-fold cross validation experiments have shown that our multichannel DANN achieved an accuracy of 0.732, outperforming multiple peer machine learning models. The results of the leave-one-site-out cross validation experiments showed promise for our model to be applied to clinical data with unseen variations. The experiments using varying combinations of data modalities demonstrated discriminative power of individual data modalities such as brain functional connectome and PC data. This suggests a future direction of combining additional data modalities to move the machine learning applications towards clinical usage of ASD computer-aided diagnosis tools. One limitation of the current work is that the selected cohort is in the adolescent and young adult population, which limits the generalizability of the model, since the ASD diagnosis was performed much earlier. In the future study, we would retrain the model with additional data from a wider age range of population.

Data Availability

The dataset used to support the findings of this study is available in http://fcon_1000.projects.nitrc.org/indi/abide/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work was supported in part by the Beijing Education Commission Research Project of China under grant no. KM201911232004, National Natural Science Foundation of China under grant no. 61672105, and National Key Research and Development Program of China under grant no. 2018YFB1004100.