Abstract

Automated teller machines (ATM) are widely being used to carry out banking transactions and are becoming one of the necessities of everyday life. ATMs facilitate withdrawal, deposit, and transfer of money from one account to another round the clock. However, this convenience is marred by criminal activities like money snatching and attack on customers, which are increasingly affecting the security of bank customers. In this paper, we propose a video based framework that efficiently identifies abnormal activities happening at the ATM installations and generates an alarm during any untoward incidence. The proposed approach makes use of motion history image (MHI) and Hu moments to extract relevant features from video. Principle component analysis has been used to reduce the dimensionality of features and classification has been carried out by using support vector machine. Analysis has been carried out on different video sequences by varying the window size of MHI. The proposed framework is able to distinguish the normal and abnormal activities like money snatching, harm to the customer by virtue of fight, or attack on the customer with an average accuracy of 95.73%.

1. Introduction

ATM is a computerized telecommunication device that serves the customer of a financial firm with a swift access to financial transactions in a public space by exempting the need for a clerk or bank teller. The numbers of ATM installations are increasing dramatically to support the transactions in billions. Increase in nefarious activities like robbery, murder, and other crimes have raised an urgency to install an effective system that can protect people as well as ATM installations [1, 2]. Generally ATM installations are equipped with CCTV cameras that keep a watch on the activities. Unfortunately, CCTV is not sufficient to provide security due to their inability to recognize unusual behaviors themselves [3] and hence monitoring authority needs to monitor these feeds 24 × 7 which is a challenging task. Today, we need an advanced system that can effectively monitor and automatically recognize unusual crime activities in an ATM room and can also report to the nearest monitoring firm before an offender could elope. Another approach to handle this situation could be an alarm system or electrical buzzer. Each ATM premise can be equipped with an electric buzzer. ATM user can press this buzzer to send signal to response group if any abnormal event takes place. Alarm systems may become ineffective as individual alarm must be responded by a main alarm response group, which should first examine the type or nature of the event being alarmed before any help signal can be requested. In addition, most alarms require a noticeable effort to operate, presenting an uncertainty that the perpetrator can simply physically stop the victim from triggering the alarm or may take a belligerent action against the victim if the victim is seen to initiate an alarm signal. Absence of automated security mechanism leads to postincident forensic analysis by the law agencies. Many a time law enforcement authorities become aware of the crime after several hours after the incident. This is a major problem in the urban areas as well as in the rural areas. Recently, a gruesome attack on a woman at an ATM located in Bangalore city, India [4], has brought to focus the issue of security at such kiosks (Figure 1(a)). This incident has sent shock waves across the country and highlighted the need to tackle such brutal acts. In some cases, ATM guard is also killed when he tries to save the victim because attackers are generally equipped with weapons like machete, guns, pistols, iron and rod and usually are multiple in numbers. Figure 1(b) [5] depicts the typical scene of a guard tied down with a rope by two attackers in Bangalore city, India, and Figure 1(c) [6] depicts the attack on a man at Karachi city, Pakistan. Therefore, it is necessary to have an automated system that can proactively identify and generate alarm on unusual behavior.

Video based human activity analysis has gained lots of attention amongst the researchers. The goal of human activity recognition is to analyze different activities automatically from an unknown video [7]. Analysis of various activities involves recognition of motion pattern and generation of high level description of actions. There are various approaches like manifold approaches, spatiotemporal interest of feature points, motion history images, accumulated motion image, and bag of words model which are recently used by many researchers for effective human action recognition and representation [812]. In this paper, we present a system that can amend the current trends of the surveillance system. The system can automatically recognize different actions or number of persons through a CCTV camera like single normal, multiple normal, and multiple abnormal and generate signal accordingly. Using our system, the offender is more likely to be caught by the police red-handed because they are informed about the crime instantly. In addition, the proposed system can be used to generate automated alarm that can alert security guard deputed at the ATM location as well as other people around the premise to obtain immediate security. The paper is organized as follows. Section 2 presents related works and background of this work. In Section 3, we present our proposed method. Section 4 depicts the results and analysis of the proposed approach. Finally, conclusions are drawn in Section 5.

2. Literature Review

The intricacy at ATM booth described by COPS [1, 2] is the main motivation behind this research which has inspired us to develop an effective security system. In this section, we present the related work and research undergone in developing video based security system that helped us to make an efficient surveillance system. Various approaches have been proposed by researchers for human action recognition (HAR). Davis and Bobick [13] in their paper have presented the usage of temporal templates for recognizing human actions. References [7, 14] have presented a detailed survey on human motion and behavior analysis using MHI and its variants. Other approaches like Optical Flow and Random Sample Consensus (RANSAC) by [8] decipher the representation and recognition technique of human actions. For feature extraction, Hu has proposed a novel theory popularly known as Hu moments which are invariant to translation, scale, and rotation [15]. Bobick and Davis [16] in their paper have shown the usage of Hu moments for feature extraction from temporal templates. Hu moments are widely used shape descriptors due to its simplicity and less computational approach [1719]. Various other descriptors like Fourier descriptors (FD) and Zernike moments have also been proposed. Fourier descriptors prove to be a disadvantage when the image size varies because the number of points also varies and the method becomes computationally high to work at real time. Zernike moments are advanced version of Hu moments whose magnitude is invariant to rotation but their computation time is also extensive to work at real time [20]. Besides the availability of various methods for feature extraction, we have used the conventional Hu moments method for shape description of the MHI/MEI. It is because Hu moments are computationally effective as compared to other descriptors. To make machine learn these features, a classifier has to be used. There are varieties of classifiers available like support vector machine (SVM), neural networks (NN), and Bayesian classifier. Debard et al. [21] have presented the identification of abnormal event, that is, fall using SVM. References [2225] have shown a great adaptive learning of support vector machine in video surveillance. SVM, apart from its learning from two classes [26], has shown multiclass classification through SVM which helped us to analyze multiple classes through it. Sometimes it does happen that redundancy in data comes inherently from the video. For instance MHI/MEI formed by the presence of two persons in a video is also formed by the presence of an obese person. This kind of data may reduce the learning accuracy of SVM. So to address this kind of problem, principal component analysis (PCA) has been used. Reference [27] has shown the use of PCA with SVM in the work in action recognition in video. Another great work from [28, 29] has illustrated the use of PCA in dimension reduction. The main motive of this paper is to build a strong security framework system which can work at real time environment at ATM booth or other similar premises.

3. The Proposed Methodology

The proposed methodology/system (Figure 2) uses computer vision techniques for recognition of normal and abnormal behavior of a person. The system consists of a structure where objects are moving with respect to a fixed background and each frame of video is processed as follows. First, foreground extraction technique is used to obtain clear silhouette of people. Then a fixed size window is used to record the MHI. The MHI is used to generate pattern of a person under different situations. To describe this pattern, Hu moments are used. These dimensions are further reduced by applying principal components analysis (PCA) to remove redundancies and make the system computationally effective. Further we make use of support vector machine to predict the most likely class and the result is displayed.

3.1. Feature Extraction
3.1.1. Background Subtraction and Motion History Image (MHI)

Background subtraction is used to extract the foreground objects from video sequence. Generally, ATMs are installed in a closed enclosure where background does not change over the time. We make use of the frame without any moving object as a background frame. Subsequent frames are subtracted from this frame to obtain moving objects.

Figures 3(a) and 3(b) represent the two images to be subtracted and Figure 3(c) shows the output binary image received after its conversion from grayscale to binary.

After obtaining the foreground objects, we compute the MHI. Motion templates are an efficient way to record general movement or motion and are suitable for human activity recognition [16] and gesture recognition [30]. The MHI is a binary image where pixel intensity is a function of the recency of motion in a video sequence. The pixel intensity is linearly ramping value as a function of time, where brighter (more whiter) values represent the more recent motion locations. As an object moves it leaves behind a motion history of its movements. With the passage of time the old motion histories of object are eliminated to capture the new motion patterns so that old patterns do not get mixed with the new one. MHI at any given time is given as where is intensity of pixel at time of diffImage. is a constant representing brighter pixel value. is a constant representing less bright pixel value. Consider (window size) = .

Algorithm of MHI is as shown in Algorithm 1.

Input: diffImage, ,
Output: mhiImage
height = mhiImage.height();
width = mhiImage.width();
for   to   do
    for   to   do
   if  () 0 then
      mhiImage.setPixel() = ;
      continue;
   end
   if  () − 1 >   then
      mhiImage.setPixel() = val;
      continue;
   else
      mhiImage.setPixel() = 0;
   end
    end
end

Original mhiImage starts out as blank image or pixel with all zeroes.

Algorithm of feature extraction from MHI using Hu moments is shown in Algorithm 2.

Input: mhiImage
Output: hu
Moments = getMoments(mhiImage);
central_moments = getCentralMoment(moments);
norm_moments = getNormalizedCentralMoment(central_moments);
for   to 7 do
    hu[] = getHuMoments(norm_moments, );
end

3.1.2. Hu Moments

Once MHI is obtained, features need to be extracted from it. We have used Hu moments for this purpose. The Hu moments [15], obtained from the templates are known to yield reasonable shape discrimination in a translation and scale invariant manner. Hu moments provide seven values as an extracted feature from a given image. These moments are invariant to translation, scale, and rotation of an image. Out of seven invariants, six are absolute orthogonal invariants and the seventh one is skew orthogonal invariant. Hu moments are computed as follows: where are the two-dimensional th order moments of the image function . are the centroid of the image . are the central moments of the image . are the normalized central moments.

Since MHI can effectively record a motion or an activity that occurred in a small time interval as shown in Figure 4 and Hu moments can uniquely describe an image by generating a set of seven values, we have used MHI for recording activity and Hu moments for the purpose of describing that activity; Figure 9 and Table 3 support our above statement. The MHI algorithm presented in Section 3.1.2 by [31] has been applied to generate MHI on our training and testing data set followed by computation of Hu moments using Algorithm 2 based on (2)–(8). The eight attributes, seven Hu moments along with an area, are used to describe an image pattern. These eight attributes are fed to principal component analysis for further reduction in the attribute set. Principal component analysis is an efficient technique for reducing high-dimensional data, by computing dependencies between the attributes to represent it in a more tractable, lower-dimensional form, without losing much information. WEKA [32] is used for applying PCA.

3.2. Action Classification Using Support Vector Machine

Normal task of machine learning is to learn from a, usually very large, space of data to classify the one that will best fit the data based on prior knowledge. SVM is a machine learning tool and a widely used classifier in computer vision, bioinformatics, and so forth, due to its ability and high accuracy to deal with the high dimensions of data. Support vector machine is a popular machine learning method for classification, regression, and other learning tasks. LibSVM [33] is a library for support vector machines used by us. A typical use of LibSVM involves two steps: first, training a data set to obtain a model file and second, using the model file to predict information of a testing data set. RBF Kernel is used for both training and testing purpose.

4. Experimental Results and Analysis

The system has been trained and tested using java [34] and opencv [30, 35] on a computer having Intelcore i3, 2.13 GHz processor with 2 GB RAM on a video of 320 × 240 resolution for different number of MHI frames. The system was tested against three classes, single normal, multiple normal, and multiple abnormal, over six videos (2 single of 10 seconds each, 2 multiple normal of 27 seconds each, and 2 multiple abnormal of 29 seconds each). We have made our own data set for training and testing purpose by taking seven actors (five boys and two girls) for frame size 320 × 240 and 25 fps frame rate. The system is trained using these videos for different number of MHI frames (Table 1). Testing is done on a different video from the one used for training purpose. The system was tested for different number of MHI frames (5, 10, and 15) (Table 2). Consider

TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative values, respectively.

Table 4 shows the values of TP, TN, FP, and FN for three different classes on different window size of MHI used on our testing data set. Table 5 gives the accuracy, precision, and recall of three classes calculated using (12), by LibSVM over our testing data set. The comparison is made among three different numbers of MHI frames taken as 5, 10, and 15. Advantage of MHI representation is that a range of time (in terms of frames) is encoded in a single frame. Selection of number of frames to form MHI is very important because variation in number of frames may provide different information regarding event. Hence for effective recognition system it becomes necessary to identify suitable window size (number of frames) for MHI.

The system is tested over a video of 1 minute and 46 seconds typically consisting of all three classes. Total number (all classes) of sample MHI values for testing in three different MHI frames (5, 10, and 15) is 720, 359, and 238, respectively. We have observed that window size ten is most appropriate for recognizing abnormal events. Apart from this testing, a 10-fold cross-validation of [6] data set has been done to support the correctness of the proposed methodology. The results are represented in Table 7. Color code for prediction results in colored images: one person (normal working), green color; multiple persons (normal working), blue color; multiple persons (abnormal working), red color. MHI and corresponding prediction results are shown in Figures 5, 6, and 7 for different window size (5, 10, and 15). Table 6 shows the value of AUC for different classes and MHI frames. Figure 8 shows the corresponding ROC curve on the testing data set.

5. Conclusion

In this paper, we have presented a system for security framework at ATM that can also be used in similar premises. In particular, this paper presents the recognition of normal and abnormal events at the ATM. The need of developing such security system is the increasing number of crime rates at the ATM booth and also the lack of prevailing video surveillance system in the market. The system accuracy differs for different MHI frames. In our case, the overall prevision accuracy was 92.31% for 5 MHI frames, 95.73% for 10 MHI frames, and 89.07% for 15 MHI frames on our testing data set using LibSVM. The main reason of low accuracy of system in 5 MHI frames is due to the fact that a very few number of frames contribute to the formation of MHI where only a small part of an activity pattern is recorded in an image thus affecting the recognition rate of that activity, whereas in case of 15 MHI frames as a large number of frames contribute to the formation of MHI, it is more likely that the previous activity pattern is hindered by subsequent activity pattern when the motion is frequent thus causing a formation of distorted pattern resulting in low recognition rate. The above stated problems were less in case of 10 MHI frames; hence the accuracy rate was better. Our system’s overall accuracy would have been higher if we had removed the transition frames between normal and abnormal activities. Since at real time this could not be eliminated hence we have included this scenario. The future scope of this paper is wide open in research aspect. Various other feature extraction methods can be applied to test the accuracy of the system. Also, other classifiers like SVM can be used for the same purpose. Since our system is restricted to work for video only, our future aspect will be to focus on audio based recognition also.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.