Computational Intelligence and Neuroscience

Volume 2018, Article ID 7186762, 16 pages

https://doi.org/10.1155/2018/7186762

## A Novel Deep Learning Approach for Recognizing Stereotypical Motor Movements within and across Subjects on the Autism Spectrum Disorder

Faculty of Science and Technology, University Hassan 1^{st}, Settat, Morocco

Correspondence should be addressed to Lamyaa Sadouk; moc.liamg@kuodas.aaymal

Received 28 March 2018; Accepted 10 June 2018; Published 10 July 2018

Academic Editor: Amparo Alonso-Betanzos

Copyright © 2018 Lamyaa Sadouk et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by persistent difficulties including repetitive patterns of behavior known as stereotypical motor movements (SMM). So far, several techniques have been implemented to track and identify SMMs. In this context, we propose a deep learning approach for SMM recognition, namely, convolutional neural networks (CNN) in time and frequency-domains. To solve the intrasubject SMM variability, we propose a robust CNN model for SMM detection within subjects, whose parameters are set according to a proper analysis of SMM signals, thereby outperforming state-of-the-art SMM classification works. And, to solve the intersubject variability, we propose a global, fast, and light-weight framework for SMM detection across subjects which combines a knowledge transfer technique with an SVM classifier, therefore resolving the “real-life” medical issue associated with the lack of supervised SMMs per testing subject in particular. We further show that applying transfer learning across domains instead of transfer learning within the same domain also generalizes to the SMM target domain, thus alleviating the problem of the lack of supervised SMMs in general.

#### 1. Introduction

Autism Spectrum Disorder ASD refers to a spectrum of disorders with a range of manifestations that can occur on different degrees and in a variety of forms [1]. Children with ASD have impairments in social interaction and communication as well as atypical behaviors that include repetitive behaviors known as stereotypical motor movements (SMM). The most prevalent SMMs include repetitive body rocking, mouthing, and complex hand and finger movements [2] which interfere tremendously with learning and social interactions, thus reducing school and community integration [3, 4]. Inasmuch as they are largely resistant to psychotropic drugs, decreasing or eliminating SMMs is the goal of many behavioral interventions in autism. And the earlier the age at which intervention can be started is, the better their learning and daily function can be facilitated [5, 6].

To address this challenge, the role of sensing technologies becomes critical in identifying SMMs throughout the screening and therapy of ASD, thus potentially improving the lives of those in the spectrum. Indeed, reliably and efficiently detecting and monitoring SMMs over time could be beneficial not only for recognizing autistic children but also for understanding and intervening upon a core symptom of ASD. Efficient and accurate monitoring of SMM is needed to (i) continuously evaluate which therapies and/or drugs are required over time and (ii) identify mechanisms that are responsible for triggering an SMM, such as physiological, affective, or environmental factors. Such monitoring can help therapists reduce the frequency of SMMs and gradually lower their duration and severity [4].

The main problematic for a perfect SMM recognition is to find the most relevant features that characterize stereotypical behaviors from accelerometer signals through an automatic feature extraction technique. Another problematic is personalization caused by the intra- and intersubject variability [7, 8]. Indeed, intrasubject variances can be explained by the variations in intensity, duration, and frequency of SMMs within each atypical subject (subject on the autism spectrum) while intersubject variances are due to differences in SMMs across different atypical subjects. Hence, an adaptive approach is needed to generalize over all SMMs within and across subjects and adjust to new SMM behaviors. These problematics can be addressed using deep learning techniques for SMM recognition. Our contributions are the following: (i) for SMM detection within subjects, we train two CNN models in time and frequency-domains whose parameters are chosen based on properties of the input SMM signals, (ii) for SMM detection across subjects, we build a global, fast and light-weight platform combining a knowledge transfer platform to an SVM classifier, which provides promising results by adapting to SMMs of any new atypical subject. One advantage of this platform is the use of only few labeled data of this atypical subject as well as a large data of some other atypical subjects, thus resolving the “real-life” medical issue associated with the lack of supervised SMMs per atypical subject, (iii) applying cross-domain transfer learning (i.e., coming from a source domain different from the SMM target domain) instead of within-domain transfer learning provides a satisfying SMM recognition performance especially in time-domain and thus adjusts to stereotypical behavior patterns of any new atypical subject with only few of his labeled data.

In the section below, we review related works of SMM recognition (Section 2). Then, we describe our methodology, namely, our proposed deep learning model for SMM recognition within subjects as well as our scalable platform for SMM recognition across subjects (Section 3). In Section 4, a description of the datasets used is given, the data preprocessing is detailed, and the CNN architecture structure is defined. In Section 5, experiments are explained and carried out, and corresponding results are laid out. Finally, Section 6 concludes our work.

#### 2. Related Work

Several studies have been conducted in order to detect SMM behaviors in individuals with ASD. Traditional methods of SMM relied primarily on paper-and-pencil rating scales, direct behavioral observation, and video-based coding, all of which have limitations. While paper-and-pencil rating scales and direct behavioral observation may provide some unreliable and invalid measures, video-based methods provide more accurate and reliable measures but are time consuming because experts have to view videos repeatedly with slow playback speeds.

Automatic approaches that are time efficient and more accurate have been proposed. These approaches rely on wireless accelerometers and pattern recognition algorithms. Existing approaches to automated monitoring of SMM are based on either webcams or accelerometers. In a series of publications [9–11], Gonçalves et al. make use of the Kinect webcam sensor (from Microsoft) to detect only one type of SMM, namely, hand flapping. However, the Kinect sensor is limited to monitoring within a confined space and requires users to be in close proximity to the sensor.

Alternative approaches to the Kinect are based on the use of one or more wearable 3-axis accelerometers (located in wrist or torso for instance). Using three, three-axis accelerometers sampling at 100Hz, Westeyn et al. [12] recognized 69% of hand flapping events using Hidden Markov Models; however, data were acquired from healthy individuals mimicking behaviors. A similar work that did not generalize to the ASD population was the study of Plötz et al., 2012 [13]. Meanwhile, Min et al. [14–17] reports collected a total of 40 hours of three-axis acceleration data from the wrists and torso of four individuals with autism. Using a variety of different features and semisupervised classification approaches (Orthogonal Matching Pursuit, Liner Predictive Coding, all-pole Autoregressive model, Higher Order Statistics, Ordinary Least Squares, K-VSD algorithm), they achieved uppermost accuracy rates of 86% for hand flapping and 95% for body rocking; however, true positive and false positive rates were not adequately described. Goodwin et al. [7, 18–20] collected three-axis acceleration data from the wrists and torso of six individuals with autism repeatedly observed in both laboratory and classroom settings. By combining time and frequency-domain features (distances between mean values along accelerometer axes, variance along axes directions, correlation coefficients, entropy, Fast Fourier Transform (FFT) peaks, and Stockwell transform) to C4.5 decision, SVM, and DT classifiers, they were able to achieve an overall accuracy of 81.3% on SMM detection. And recently, Großekathöfer et al. [8] applied Goodwin’s benchmark to introduce a new set of features based on recurrence plot and quantification analysis along with Decision Trees and Random Forest classifiers to obtain an accuracy slightly better than [7].

Nonetheless, features of all these previous publications have several limitations: (i) they are mainly aimed at characterizing oscillatory features of SMMs as statistical characteristics, joint relation of changes in different axial directions, or frequency components of oscillatory moves. Therefore, these features cannot capture dynamics of SMM that can change over time, namely, when they do not follow a consistent oscillatory pattern or when patterns differ in frequency, duration, speed, and amplitude, (ii) each and every study has a sensor type different from the other with different sensor orientations, resulting in features with different values even though they characterize the same SMM behavior, (iii) these hand-crafted features seriously depend on parameters of random combinations of features and models selected by experts and researchers, which is computationally and timely expensive.

To overcome these limitations, Rad et al. [21] proposed features learned by deep convolutional neural networks, which performed higher than traditional learning methods in SMM detection within subjects (i.e., within each subject independently). However, because of intersubject variability (i.e., pattern variations across subjects), models of this deep learning work and all previous works are built for each and every subject based on a large dataset of labeled observations per subject. So, up to now, there is no general system that adapts to SMMs of all subjects. Accordingly, we propose a deep learning system that not only learns better discriminating features for SMM pattern recognition but also is robust to intersubject variability and adapts rapidly to SMMs across subjects with only few labeled observations per subject. We hypothesize that feature learning using deep learning models such as CNNs and transfer knowledge capabilities of these models provide more accurate SMM detectors, as well as a platform for tracking not only SMMs of different record sessions within the same atypical subject but also SMMs of different atypical subjects.

#### 3. Methodology

In this section, we introduce the key components for training our deep learning models. First, we first define the SMM detection problem and describe how SMM data is extracted and converted into input signals (Section 3.1). Next, we present a background on our deep learning model, namely, the CNN (Section 3.2). Section 3.3 explains how the optimal parameters of CNN models are chosen based on an analysis of the input signals. Finally, Section 3.4 defines the global light-weight platform that adapts to SMMs of any atypical subject.

##### 3.1. The SMM Detection Problem and Data Extraction

The SMM dataset consists of time-series data that are composed of multiple channels D, i.e., , , and coordinate measurements recorded from multiple sensors/devices. The first step is to convert these D-channel raw data into multiple fixed-length signals in both time and frequency-domain, which will be denoted as frames. The second step is to normalize these latter before being fed into our deep learning system.

###### 3.1.1. Extraction of Dataset Signals

In order to extract* time-domain* frames, raw data, which represent multiple time-series of length each (one time-series per channel), are segmented with a fixed-length sliding window and with a high overlap rate of between consecutive data segments, which means that the sliding window moves with time steps where . This results in frames (i.e., samples), where and each frame is a time-point sample of size , referring to the number of channels. Thus the extracted data is a matrix.

In order to extract* frequency-domain* frames, time-series data are converted into frequency signals using Stockwell transform (ST) [22, 23] instead of the Fast Fourier Transform (FFT) which is restricted to a predefined fixed window length. Unlike FFT, ST adaptively captures spectral changes over time without windowing of data and therefore results in a better time-frequency resolution for nonstationary signals [22].

Let , , be the samples of the continuous signal , where is the sampling interval (i.e., the sampling interval of our sensor or measuring device). The Discrete Fourier Transform (DFT) is given bywhere the discrete frequency index .

The Discrete Stockwell Transform (DST) is obtained using the following formula:where is the index for time translation and is the index for frequency shift. The function is the Gaussian window in the frequency-domain.

For a signal of length , the numerical implementation of the DST can be summarized as below:(1)Apply an -point DFT to calculate the Fourier spectrum of the signal .(2)Multiply with the Gaussian window function .(3)For each fixed frequency shift (where is the number of frequency steps desired), apply an -point inverse DFT to in order to calculate the DST coefficients , where .

Note that there is a DST coefficient calculated for every pair of in the time-frequency-domain. Therefore, the result of the DST is a complex matrix of size where the rows represent the frequencies for every frequency shift index and the columns are the time values of every time translation index ; i.e., each column is the “local spectrum” for that point in time.

Knowing that our time-series data is composed of multiple channels , we end up with DST matrices, which can be represented as a single matrix **.**

###### 3.1.2. Normalization

Since each of the data instances has channels, each channel has to be normalized separately. A channel-wise normalization is performed to scale all values to according to the formula , where is the data in a specific channel, the mean of that channel, and its standard deviation.

##### 3.2. Research Background on Convolutional Neural Networks

Given an input (**) **which is a matrix representing one acceleration signal at time , our CNN model will predict the probability of being an SMM , corresponding to an SMM sample, and the probability of being a non-SMM .

CNN is regarded as a specialized type of neural networks which updates weights at each layer of the visual hierarchy during training via a back propagation mechanism [24]. CNN benefits from invariant local receptive fields, shared weights, and spatiotemporal subsampling features to provide robustness over shifts and distortions in the input space [25]. A classic CNN has a hierarchical architecture that alternates between convolutional and pooling layers in order to summarize large input spaces with spatiotemporal relations into a lower dimensional feature.

*Convolution***. **Each convolutional layer performs a 1D convolution on its input maps followed by an activation function, generally a rectified linear unit (ReLU), to add nonlinearity to the network and also to avoid the gradient vanishing problem. The output feature maps are generated for each convolution layer and the resultant value of unit at position in the th feature map in the th layer, denoted as , is given bywhere is the activation function, is the bias term for the th feature map of the th convolutional layer, indexes over the set of feature maps in the th layer connected to the current feature map, is kernel index, is the filter size (width of the convolutional kernel), and is the weight or value at the position of the kernel connected to the th feature map.

*Max-Pooling***. **In order to reduce the sensitivity of the output to shifts and distortions, is passed through a pooling layer which performs a subsampling over a pooling window with size of R elements and calculates the reduced feature map. The output feature map is achieved by computing maximum values over nearby inputs of the feature map:where is the pooling size and is the pooling stride.

*Output***. **The output of these layers generated by stacking several convolutional and pooling layers is flattened as a vector, a feature vector which is then fed into a fully connected layer (F1), and an activation layer to produce the following output:where is the activation function, is the weight connecting the node on layer and the node on layer , and is the bias term.

Then a dropout layer is added to prevent the neural network from overfitting. Figure 1 shows the fully connected neural network with dropout (F1).