Indian Classical Dance Action Identification and Classification with Convolutional Neural Networks
Extracting and recognizing complex human movements from unconstrained online/offline video sequence is a challenging task in computer vision. This paper proposes the classification of Indian classical dance actions using a powerful artificial intelligence tool: convolutional neural networks (CNN). In this work, human action recognition on Indian classical dance videos is performed on recordings from both offline (controlled recording) and online (live performances, YouTube) data. The offline data is created with ten different subjects performing 200 familiar dance mudras/poses from different Indian classical dance forms under various background environments. The online dance data is collected from YouTube for ten different subjects. Each dance pose is occupied for 60 frames or images in a video in both the cases. CNN training is performed with 8 different sample sizes, each consisting of multiple sets of subjects. The remaining 2 samples are used for testing the trained CNN. Different CNN architectures were designed and tested with our data to obtain a better accuracy in recognition. We achieved a 93.33% recognition rate compared to other classifier models reported on the same dataset.
Automatic human action recognition is a complicated problem for computer vision scientists, which involves mining and categorizing spatial patterns of human poses in videos. Human action is defined as a temporal variation of human body. The last decade has seen a jump in online video creation and the need for algorithms that can search within the video sequence for a specific human pose or object of interest. The problem is to extract and identify a human pose and classify it into labels based on trained CNN feature maps. The objective of this work is to extract the feature maps of Indian classical dance poses from both online and offline data.
However, the constraints are video resolution, frame rate, background lighting, scene change rate, and blurring to name a few. The analysis on online content is a complicated process as the most of the users end up uploading the videos with poor quality, which shows all the constraints as a hindrance in automation of video object segmentation and classification. Dance video sequences online are having far many constraints for smooth extraction of human dance features. Automatic dance motion extraction is complicated due to complex poses and actions performed at different speeds in sink to music or vocal sounds. Figure 1 shows a set of online and offline (lab captured) Indian classical dance video frames for training and testing the proposed CNN algorithm.
The created offline dataset is having 200 Indian classical dance mudras/poses performed by 10 native classical dancers (i.e., 10 sets) at a rate of 30 frames per second (fps). Training is initiated with three different batch sizes. In Batch I of training there is only one set, that is, 200 poses performed by 1 dancer for 2 seconds each at 30 fps, total of dance pose images. Batch II of training is done using 5 sets, that is, a total of dance pose images. In Batch III of training, 8 sets of sign images were used. The trained CNNs are tested with two discrete video sets having different dance performers with varying backgrounds. The robustness testing is performed in two cases. In Case I of testing, the same dataset, that is, an already trained dataset, is used and in Case II of testing different dataset is used. The similar training and testing are done on online data also. Figure 1 shows the sample database created for this work. The performance of the CNN algorithms is measured for both online and offline data based on their accuracy in recall and recognition rates.
2. Literature Review
Local information of the human in the video is the popular feature for action identification and classification in recent times. This section focuses on giving a current trend in human action recognition and how it is used in recent works for classifying dance performances. The human action recognition is subdivided into video object extraction, feature representation, and pattern classification [1, 2]. Based on these models, numerous visual illustrations have been proposed for discriminating human action based on shape templates in space-time , shape matching , interest points in 2D space-time models , and representations using motion trajectories . Dense trajectory based methods  have shown good results for action recognition by tracking sampled points through optical flow fields. Optical flow fields are based on preconditioning on brightness and object motion in a video  by assuming uniform brightness variations and object motions in consecutive frames. Finding such a video having uniform brightness happens in a movie or in a lab setup. Hence, these approaches need to be robust in estimating human actions. Data driven methods with multiple feature fusion  with artificial intelligence models  are currently being explored with the increase in computing power.
Indian classical dance forms are practiced for 5000 years worldwide. However, it is difficult for a dance lover to fully hold the content of the performance as it is made up of hand poses, body poses, leg movements, hands with respect to face and torso, and finally facial expressions. All these movements should synchronize in precision with both vocal song and the corresponding music for various instruments. Apart from these complications, the dancer wears complicated dresses with a nice makeup and at times during performance the backgrounds are changing depending on the story which truly makes this an open-ended problem. Mohanty et al.  highlights the difficulties in using state-of-the-art pose estimation algorithms such as skeleton estimation  and pose estimation  failing to track the dancers moves in both offline and online videos. Samanta et al.  used histogram of oriented optical flow (HOOF) features with sparse representations. In our previous work, we approached the problem with SVM classifier  on dance videos and found that only multiclass SVMs should be considered. The average recognition rates obtained in implementing Adaboost , Artificial Neural Networks (ANN) , deep ANN , and adaptive graph matching (AGM)  on dance data are not up to the mark.
In recent research, application of deep learning in object recognition is most suitable. CNN is powerful in solving most computer vision based tasks [18–22] such as object recognition  and classification . Classifying at faster rate on a huge dataset is a complicated problem. Without the knowledge of an expert using deep hidden layers CNN extracts image information and avoids the process of complex feature extraction. The first deep CNN is introduced by LeCun et al.  with two convolutional layers. As the image datasets become of large scale, this requires a wider and deeper CNN such as ImageNet . By increasing the depth of the network further Simonyan and Zisserman  proposed VGGNet.
Andrew Ng et al. have performed fundamental research on CNNs to achieve improved performance of CNN algorithms and structural optimization [28–31]. LeCun et al. in  highlighted that deep CNN is a breakthrough in image, video, audio, and speech processing. So far, no extensive research has been done which explores deep CNN for Indian classical dance classification. The aim of this paper is to bring out the CNN performance in recognizing the mudras/poses of ICD.
Deep learning methods can also be used in evaluating the ICD classification system. However, still there is a great space for further improvements. Deep CNN is suitable for giving solutions to complex problems with a huge quantity of data . For example, the classification accuracy is improved in ImageNet dataset  which has 1.2 million images almost covering 1000 categories. In such cases, we need to consider how to take advantage of CNN.
With convolutional neural networks, we need to consider how to design and train a network that adapts to various objects. The major problem to be solved is with the quality and sizes of the images. The unbalanced amounts of low and high quality images in the dataset lead to the unbalanced classification.
The motivation for implementing the deep CNN model for Indian classical dance identification is that the feature learning in CNNs is highly automated from the input images, avoiding the complexity in extracting the various features for traditional classifiers for sign recognition. Through the deep architecture, the learned features are deemed as the higher level abstract representation of low level sign images. Hence, we develop the deep CNN model for Indian classical dance identification in this paper.
In this paper, a novel CNN based Indian classical dance identification method is proposed to achieve higher recognition rates. Different CNN architectures are implemented and tested on our dance data to bring out the best architecture for recognition. Three different pooling techniques, namely, mean pooling, max pooling, and stochastic pooling, are implemented and stochastic pooling was found to be the best for our case. To prove the capability of CNN in recognition, the results are compared with the other traditional state-of-the-art techniques SVM, AGM, Adaboost, ANN, and deep ANN.
The rest of the paper is as follows: in Section 3, the proposed architecture of CNN is described. Section 4 discusses the results obtained in different cases. Finally, Section 5 concludes the outcomes of this paper.
3. System Architecture
We designed our multistage CNN model by acquiring knowledge following [21, 25]. The model is constructed with input layer, four convolutional layers, five rectified linear units (ReLu), and two stochastic pooling layers, one dense and one SoftMax output layer. Figure 2 shows the proposed system architecture.
The proposed CNN architecture uses four convolutional layers with different window sizes followed by an activation function and a rectified linear unit for nonlinearities. The convolutional windows are of sizes , , , and from layers 1 to 4, respectively. Three kinds of pooling strategies were tested via mean pooling, max pooling, and stochastic pooling and stochastic pooling was found to be suitable for our application. The feature representation is done by considering two layers of stochastic pooling. Only two layers of pooling are initiated to avoid a substantial information loss in feature representation. Classification stage is implemented with dense/fully connected layers followed by activation functions. SoftMax regression is adopted in classification.
Indian classical dance video frames of size are taken as an input to the system. As a first step, the frames are preprocessed by resizing them to . Resizing of input video frames will increase the computational capability of the high-performance computing (HPC) on which the program is being implemented. The HPC used for training the CNN is a 6-node combined CPU-GPU processing machine.
Let us assume an input video frame of size . The convolutional kernel with size is considered for convolution with a stride of and padding for filling the input video frame boundary. The size of the output of convolution layer is given by The architecture of our CNN model consists of four convolutional layers. While the first two layers extract the low level features (like lines, corners, and edges), the last two layers learn high level features. The detailed layer information and their output sizes with parameters are tabulated in Table 1.
The output of a convolutional layer is generally denoted with the following standard equation aswhere represents the th layer, is the convolutional kernel, represents bias, and the input maps are represented by . The CNN uses a tanh activation function with an additive bias formulated as
represents feature map bias which are non-supervisory trained and , are the kernel width and height, respectively. is the weight of the kernel at position . Over a region, the max value of a feature is obtained using pooling technique, which reduces the data variance. We implemented our architecture with stochastic pooling technique by calculating the probability values for each region. For every feature map , the probability is given by where is the neuron activation function at a point in spatial coordinates and is the weighing function of window. When compared to other pooling techniques, stochastic pooling makes CNN converge at a faster rate and improves the ability of generalization in processing invariant features.
This Indian classical dance identification is a multiclass classification problem. Hence, a SoftMax regression layer given by a hypothesis function is being used as must be trained in a way that the cost function is to be minimized. The classification probability in SoftMax regression layer for classifying an input as a category is given asThe network is trained to learn the features of each dance pose by means of a supervised learning. The internal feature representation reflects the likeness among training samples. We outline 200 poses from ICD (offline data) performed by 10 classical dancers. The size of the total dataset is 2000 dance poses with each pose recording normalized to 2 secs or 60 frames per second forming a total of 120k frames. Similarly, the online dance data is downloaded from YouTube and each pose normalized to 60 fps. All together to know the feature representation learned by the CNN system, the maximized activation neuron is extracted to recognize the dance pose accurately. Finally, the feature maps were visualized by averaging the image patches with stochastic response in higher layers.
4. Results and Discussion
The main goal of this work is to correctly identify the dance pose from an ICD dataset. We attempted Indian classical dance identification using SVM classifier , adaptive graph matching , traditional ANN , and deep ANN . Among these classifiers, deep ANN outperformed them with massive recognition rates at a lower speed. But real-time implementation of ICD identification demands a high-speed classifier. To improve the speed of the recognition, Adaboost classifier  is introduced. Even though the recognition is fast, the classification results were found to be somewhat unreliable at times. Hence, this paper introduces using the powerful CNN tool to classify ICD poses at a faster speed with the best recognition rates.
The proposed model of CNN is applied to the both online and offline Indian classical dance database for classification. The online database is downloaded from YouTube and the offline database is created in our laboratory as mentioned in Section 1. Each dance pose image in the dataset is preprocessed by reducing its dimensions to which will improve the computational speed of CNN.
4.1. Batch I: CNN Training with Only One Set of Data
Training of our proposed CNN model is done in three batches. In Batch I of training, only one set of data, that is, 200 poses performed by one classical dancer captured with DSLR camera for 2 seconds at 30 fps forming a dataset with a total of 12000 images, is used. The images are preprocessed and training is initiated using our proposed CNN architecture. Similarly, the online data is preprocessed and training is initiated. The CNN algorithm is implemented on Python 3.6 platform using a high-performance computing (HPC) machine with 6 CPU-GPU combination.
The CNN is trained using a gradient-descent algorithm at two stages. Stage one handles the multiclass classification problem with feedforward pass having training samples from classes. Stage two is the backpropagation pass. The error function is computed as where is the label of th pattern of th dimension and is the corresponding value of the layer unit. The output of the convolutional layer is the tanh activation function of this value. The backpropagation pass is from higher to lower layers and the error in th layer is calculated asThe weight in th layer is updated according to .
During the separate training of online data and offline data, different feature maps were observed at different layers. Figures 3(a) and 3(b) visualize the feature maps of one offline dance pose frame obtained in convolutional layer 1 and convolutional layer 2 with 16 filters and Figures 3(c) and 3(d) for online dance pose frame.
Low level features like lines, edges, and corners are learned from convolutional layers 1 and 2. High level features learned from convolutional layers 3 and 4 are visualized in Figure 4. A stochastic pooling which combines the advantages of both mean and max pooling techniques is implemented. It also overcomes the problem of overfitting. Increasing the number of pooling layers will increase substantial information loss. Hence, the stochastic pooling is implemented in only two layers which is achieved by calculating the probability values of each region.
In Batch I we have used one dataset for training. Testing was carried out in two cases. In Case I of testing, the same dataset is used (i.e., training and testing were done on the same dataset); for Case II of testing, different dataset is used. In both the cases, good recognition rates were obtained and are tabulated in Table 2.
4.2. Batch II: CNN Training with Two Sets of Data
In this case, two sets of data created from two classical dancers are used for training. For this batch dataset is created with 200 Indian classical dance poses of five native classical dancers for 2 seconds each at 30 fps. Training is performed for two sets of data on HPC machine in 100 epochs. Testing is initiated in two cases as mentioned in the previous section. Case I of testing uses the same dataset which is used in training. For Case II of testing, the sixth dataset is used. The acquired recognition rates with both online and offline data were tabulated in Table 2. Here, by increasing the number of datasets for training it is observed that a good amount of recognition is achieved compared to Batch I training. It is also observed that the accuracy in recalling the pose is substantially increased as the number of training datasets increased. However, the training time increased by 50% than the Batch I training process.
4.3. Batch III: CNN Training with Three Sets of Data
Further improvement in recognition rates is achieved by increasing the training to CNN. A total of ten datasets were created, out of which 8 datasets were used in training and 2 sets for testing. An increase in recognition rates was obtained using this batch for training. Figure 5 shows the training accuracy versus validation accuracy plot for Batch III training set. It shows that the validation accuracy is good and with less amount of overfitting.
Figures 6(a) and 6(b) plot losses during training of Batch III on offline data and online data, respectively. There is a small difference in training and validation losses with an overall less than normal loss coefficient. An average confusion matrix is generated based on the recognition rates and the number of matches for Batch III of training and Class II of testing on both offline data and online data is shown in Figures 7(a) and 7(b), respectively. For better visualization, it is shown for only a limited number of ICD poses. Batch III with 8 multiple datasets of training is showing better recognition of signs compared to the other two batches. However, we sacrifice training computation time for recognition. Time of real-time recognition is 0.4 sec per frame and it is quite fast compared to algorithms like SVM and Fuzzy classifiers.
(a) Average confusion matrix generated in Batch III of training with Case II of testing for offline data
(b) Average confusion matrix generated in Batch III of training with Case II of testing for online data
All convolutional layers are implemented with different filter windows of sizes , , , and . Reducing the filter size improves the recognition rates but increases the computational time due to the increase in number of filters. So, we used convolutional windows of sizes , , , and for conv1, conv2, conv3, and conv4 layers, respectively. Table 3 compares the performance of choosing different filtering window sizes.
A stochastic pooling adoption attained an average recognition rate of 93.33%. Implementing max pooling and mean pooling produces a recognition rate of 91.33% and 89.84%, respectively.
To further know the robustness and efficiency of implementation of Indian classical dance identification with CNN, it is compared with other classifiers used in our previous works. For faster recognition, we used SVM in  and ended with very low classification rates. Further, we replaced SVM with Adaboost classifier  and found moderate recognition rates. Later, we used adaptive graph matching (AGM) model on the data and found good recognition rates. We also tried a traditional artificial neural network (ANN) which fails in producing good recognition rates. Further, good classification scores achieved using deep ANN for Indian classical dance identification. The novelty of our proposed CNN model is tested by comparing with other well-known CNN architectures LeNet  and VGG Net . From Tables 4 and 5, it is observed that the proposed CNN architecture is promisingly working in identifying a correct label of a particular dance action pose.
The recognition accuracy is further improved by replacing ANN with deep ANN and reported an increase in recognition rate by 5%. A much better improvement of 4% in the recognition accuracy and an upward 15% in testing speed were observed in this work with convolutional neural networks. Even though CNN takes more time for training, the testing takes a comparatively far lesser computation time. Recognition rates obtained with different classifiers for offline data and online data are compared in Tables 4 and 5, respectively. Hence, CNNs are a suitable tool for simulating Indian classical dance identification accurately. Testing is done on a 64-bit CPU with a 4 GB ram memory in Python 3.6 with OpenCV and Keras deep learning libraries.
CNN is a powerful artificial intelligence tool in pattern classification. In this paper, we proposed a CNN architecture for classifying Indian classical dance poses/mudras. The CNN architecture is designed with four convolutional layers. Each convolutional layer with different filtering window sizes is considered which improves the speed and accuracy in recognition. A stochastic pooling technique is implemented which combines the advantages of both max and mean pooling techniques. We created the offline classical dance database of 200 ICD poses with 10 classical dancers for 2 secs each at 30 fps generating a total of 120000 pose frames. Training is performed in different batches to know the robustness of enormous training modes required for CNNs. In Batch III of training, the training is performed with eight sets of data (i.e., 96000 video frames) and maximizing the recognition of the ICD. Training accuracy and validation accuracies for this CNN architecture are better than the previously proposed ICD classification models. A less amount of training and validation loss is observed with the proposed CNN architecture. The average recognition rate of the proposed CNN model is 93.33% and is higher compared with the other state-of-the-art classifiers.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
E. Rodolà, S. R. Bulò, T. Windheuser, M. Vestner, and D. Cremers, “Dense non-rigid shape correspondence using random forests,” in Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, pp. 4177–4184, Columbus, OH, USA, June 2014.View at: Publisher Site | Google Scholar
P. V. V. Kishore, M. V. D. Prasad, D. A. Kumar, and A. S. C. S. Sastry, “Optical Flow Hand Tracking and Active Contour Hand Shape Features for Continuous Sign Language Recognition with Artificial Neural Networks,” in Proceedings of the 6th International Advanced Computing Conference, IACC 2016, pp. 346–351, Bhimavaram, India, February 2016.View at: Publisher Site | Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS '12), pp. 1097–1105, Lake Tahoe, Nev, USA, December 2012.View at: Google Scholar
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proceedings of the 26th International Conference on Machine Learning (ICML '09), pp. 609–616, Montreal, Canada, June 2009.View at: Google Scholar
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the in International Conference on Learning Representations (ICLR, 2015.View at: Google Scholar
H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS '06), pp. 801–808, December 2006.View at: Google Scholar