We present MOBISIG, a pseudosignature dataset containing finger-drawn signatures from 83 users captured with a capacitive touchscreen-based mobile device. The database was captured in three sessions resulting in 45 genuine signatures and 20 skilled forgeries for each user. The database was evaluated by two state-of-the-art methods: a function-based system using local features and a feature-based system using global features. Two types of equal error rate computations are performed: one using a global threshold and the other using user-specific thresholds. The lowest equal error rate was 0.01% against random forgeries and 5.81% against skilled forgeries using user-specific thresholds that were computed a posteriori. However, these equal error rates were significantly raised to 1.68% (random forgeries case) and 14.31% (skilled forgeries case) using global thresholds. The same evaluation protocol was performed on the DooDB publicly available dataset. Besides verification performance evaluations conducted on the two finger-drawn datasets, we evaluated the quality of the samples and the users of the two datasets using basic quality measures. The results show that finger-drawn signatures can be used by biometric systems with reasonable accuracy.

1. Introduction

One of the oldest ways of proving your identity is giving your signature. Many official documents require signatures from agreeing parties. Signature recognition can be divided into off-line (static) and online (dynamic) methods. While off-line systems work with images, only the shape of the signature is available, but online systems use information related to the dynamics of the signature. Due to this additional information, online systems outperform off-line systems [1].

Biometric systems can produce two types of errors: false rejections of genuine signatures (false rejection rate (FRR)) and false acceptance of forged signatures (false acceptance rate (FAR)). The overall system error is usually reported in terms of EER (equal error rate), which is defined as the system error rate when FAR and FRR are equal.

In signature database evaluations, two types of forgeries are considered: skilled and random forgeries. Skilled forgery evaluation is based on using the forgery samples available in the database (forgery samples are provided by forgers who know both the shape and dynamics of the imitated signature). Random forgery (or zero effort) evaluation is based on using random genuine samples from the dataset (corresponding to the case when the forger does not know the signature to be forged, therefore is using his/her own signature). The state of the art in automatic signature verification is presented in a study by Impedovo and Pirlo [1].

Online signature recognition is not a new research area: several online signature corpora have already been collected using digitizer tablets. While PHILIPS [2], SVC2004 [3], and SUSIG [4] databases contain only online signatures, MCYT [5] is a bimodal database containing both fingerprints and online signatures. BIOMET [6] and BioSecurID [7] contain several types of biometric data including online signatures.

Due to the increasing number of touchscreen-based mobile devices and the familiarity of users with using signatures, we consider that signatures are plausible candidates for authentication on mobile devices. A number of researches have already been conducted in this topic, although the signature databases are not publicly available [810], except for DooDB database [11]. Compared to the DooDB database, which was collected on a device using a resistive touchscreen, our database was collected on a device with a capacitive touchscreen. Specifically, while DooDB contains only the coordinates of the points of the signature and the corresponding time information, our MOBISIG database contains additional information such as pressure, finger area, and data saved from the accelerometer and gyroscope.

In this paper, we analyze whether signatures can be used for authentication in a mobile device context. Therefore, two state-of-the-art methods, a local- or function-based and a global- or feature-based system, were implemented and evaluated on the MOBISIG database. In order to compare our database to the other publicly available online signature database, we performed the same evaluations for the DooDB database using the same parameters and features. In addition, we present a few basic quality evaluations for both databases.

The main contribution of this paper is the presentation and analysis of MOBISIG signature database containing data from 83 users. The signatures are not the original signatures of the users, but users were assigned a family name and were required to create a signature for that name. Signature collection was performed on a Nexus 9 tablet under supervision; data providers were instructed on how to draw signatures using their finger. The database was collected during three sessions and contains 45 genuine and 20 forged signatures for each user. The database is publicly available at http://www.ms.sapientia.ro/∼manyi/mobisig.html.

On the MOBISIG database, the best EER for skilled forgeries was obtained by our function-based DTW system: 5.81% for a posteriori user-specific thresholds and 20.82% for common thresholds. In case we added pressure information to the coordinates and their first- and second-order differences, only the a posteriori user-specific threshold result was improved. When using a 9-inch diameter device for data collection, users tend to put down the device on the table while drawing the signatures. Therefore, we did not use the data obtained from the accelerometer and gyroscope sensors in the computations.

Comparisons with the DooDB database indicate higher signature quality in the case of the MOBISIG database, and correspondingly better performances for the verification methods studied in our paper. Our study is limited by the sample size (83 users) and a slightly unbalanced age distribution (77% of the users aged under 25); in addition, our data providers were not experts in forging signatures.

The rest of the paper is organized as follows. A literature review on signature recognition on mobile devices is presented in Section 2, completed with a review of a few papers on signature quality evaluation. Section 3 presents our MOBISIG dataset, followed by a detailed description of the methods used for signature verification. Experiments and benchmark results are presented in Section 5, whereas Section 6 compares DooDB and MOBISIG datasets along verification system performance and quality measures. Section 7 concludes the paper.

2.1. Signature Recognition on Mobile Devices

Little research has been carried out in the field of online signature recognition on mobile devices. We have only found six studies [813] reporting results obtained on signature databases captured in mobile context. The properties of the databases used in these studies are presented in Table 1.

In most of the studies concerning signature recognition, results are reported using signature databases captured on a pen tablet. However, touchscreens present some drawbacks compared to pen tablets, the most important being the quality of the captured signal. While pen tablets sample the signal uniformly with relatively high frequency, hand-held device sampling is usually event-driven with lower sampling frequency than pen tablets. Moreover, while both touchscreen devices and pen tablets are able to capture trajectory and pressure, the latter can track pen orientation. The only advantage of a touchscreen device is that it allows capturing the signature by fingertip.

One of the objectives of BioSecure Signature Evaluation Campaign (BSEC’2009) was to study the influence of acquisition conditions (digitizing tablet or PDA) on authentication systems’ performance [14]. Results are reported using signatures from 382 writers, acquired on a digitizing tablet and on a PDA, respectively. The authors reported a significant quality degradation of signatures acquired in mobile conditions.

The semester thesis of Bissig [12] is the first study reporting results using a signature database captured on a resistive touchscreen with fingertip. Four types of signals were acquired: coordinates , pressure , and finger area . Both local (function-based: DTW) and global systems (feature-based: one-class SVM and Mahalanobis distance) and the combination of these were evaluated. Unfortunately, neither the number of subjects nor the number of forgeries in the captured database is reported. However, this is the only study reporting the influence of pressure on the performance of a signature verification system.

Houmani et al. [8] report results on a new dataset collected from 64 subjects on a PDA. Unfortunately, neither the number of sessions nor the acquired signals are reported. However, they propose an entropy-based quality metric for selecting reference signatures in the enrollment phase.

Krish et al. [9] collected a new signature database using a Samsung Galaxy Note device from 25 users (20 genuine signatures from a user). Their verification algorithm combines two state-of-the-art algorithms (function-based DTW and feature-based Mahalanobis distance). Due to the missing forgery samples, only the results obtained by random forgery evaluations are reported.

Sae-Bae and Memon [10] collected an online signature dataset from 180 users using HTML5. Users were allowed to enter their signatures using their own iOS devices. The dataset is not publicly available and contains only genuine signatures; therefore, only the random forgery-type evaluation was feasible. They proposed a new histogram-based feature set and reported performance evaluations both on their and MCYT datasets.

The first publicly available database collected on a hand-held device (HTC Touch HD mobile phone) is the DooDB. This database contains data from 100 users, and it also contains doodles besides pseudosignatures. Martinez-Diaz et al. [11] report the database analysis and benchmark results, using a function-based DTW verification system with several local features. Although the EERs obtained by the random forgery evaluations are low (around 3%), those obtained for skilled forgery evaluations are too high (around 27%). In a later study [13], the skilled forgery result was improved (20.9%) by using the Gaussian mixture method.

2.2. Signature Quality

Quality evaluation of biometric datasets is a difficult problem. A biometric dataset consists of biometric samples from a number of users, usually containing a fixed number of samples from each user collected in a fixed number of sessions. Moreover, signature datasets contain skilled forgery samples for each user.

There are two ways to achieve the quality evaluation of a biometric dataset: (i) evaluating each sample of the dataset and (ii) evaluating each user of the dataset. In each case, we obtain a set of scores, and then an average score can be computed from these scores. Both the samples and the users can be evaluated by using only the genuine signatures or by using both the genuine and forgery signatures. Müller and Henniger [15] proposed two quality metrics for signature dataset evaluation. One of the quality metrics evaluates the samples, while the other one evaluates the users of the dataset. Both metrics use the DTW distance between samples.

Houmani et al. [16] proposed a personal entropy measure for online signatures and showed the existence of a clear relationship between the proposed measure and the verification performance of the user revealed by the signature verification system. This measure allowed them to categorize the users of several signature datasets. In a later study [17], they adapted the measure to the skilled forgery samples of signature datasets. Similar to their previous study, they proved the effectiveness of the quality measure by evaluating several online signature databases by using state-of-the-art signature verification methods.

One of the objectives of the BSEC’2009 competition was the evaluation of online signature algorithms with respect to the quality of the signatures [14]. The personal entropy measure introduced by Houmani et al. was used to group the signatures into different categories. The results of the competition showed that the performance of classifiers varied significantly with respect to the good and bad quality signatures. Houmani and Garcia-Salicetti [18] extended the Biometric Menagerie to online signatures and categorized the users of MCYT database using the Personal Entropy quality measure.

Kahindo et al. [19] proposed a novel signature complexity measure to select reference signatures for online signature verification systems. Guest and Henniger [20] used commercial engines for the assessment of the quality of handwritten signatures. They concluded that predicting the utility of a signature sample using a multifeature vector was possible. More recently, another novel method was proposed for the quality evaluation of off-line signatures [21].

3. The MOBISIG Database

Due to security reasons (people are reluctant to give their own signatures), participants were asked to create a signature for a given family name. Family names were selected from the first 100 most frequent Hungarian family names. Participants were also asked to practice the created signatures by drawing and deleting several attempts. The first five attempts were deleted.

The database contains signatures from 83 subjects: 49 men and 34 women, with the following age distribution: 64 subjects under 25, 12 between 25 and 40, and 7 over 40.

3.1. Data Acquisition Protocol

Data collection was performed using a Nexus 9 tablet. The device has a capacitive touchscreen of 228.2 × 153.7 × 8 mm (8.98 × 6.05 × 0.31 in.). Signatures were sampled at about 60 Hz (event-driven sampling) when the users pressed the screen. The resolution of the screen is 1536 × 2048 pixels (approx. 281 ppi pixel density). Each signature was stored as a sequence of discrete values , where are the coordinate values, t is the time stamp, are the pressure and finger area (these are normalized values [0, 1] and can be obtained through the standard Android API), are the directional velocities, are the directional acceleration of the device, and are the values obtained from the gyroscope sensor. The accelerations and the values obtained from the gyroscope characterize the holding position of the device.

The screen of the device was divided into two sections (Figure 1): the upper section was the replay section, where users were shown the animated signature, and the lower section was designed to draw the signature. The animation functionality was available in both types of signature collections: genuine and forgery. The animation allows participants to recall the shape and the dynamics of genuine and forged signatures. The animation could be replayed any number of times. Before data collection, users were asked to become familiar with the device usage as well as their pseudosignatures. Additionally, signatures were saved after user’s acceptance. Any signature could be deleted by the provider if he/she was not satisfied with the result.

The data collection process was divided into three sessions with one week between consecutive sessions. In the first session, each user had to provide 15 genuine pseudosignatures for the assigned name. In the second and third sessions, participants had to provide 15 genuine pseudosignatures and 10 forgeries for two assigned users (two times 5 forgeries). At the end of the data collection process, we had 45 genuine signatures and 20 forgeries for each participant. A few of these signatures are shown in Figure 2.

3.2. Signature Files

Each user has a dedicated folder which contains the 45 genuine signatures of the user and the 20 forgeries made by other users. The naming convention of the files is as follows: , where T is for forgeries and for genuine signatures. identifies the user whose signatures are in the folder. is the identifier of the user, who gave the signature; and are equal in the case of genuine signatures, while they are different in the case of forgeries. The at the end of the filename is a value from 1 to 45 for genuine signatures and 1 to 20 for forgeries. The first 15 genuine signatures were collected in the first session, the second 15 signatures in the second session, and the third 15 signatures were collected in the third session. The first 10 forgeries were collected in the second session, while the second 10 forgeries in the third session. The naming conventions of the folders is . Each signature is represented as a sequence of points and is stored in a file. Each line of the file represents one point of the signature and consists of the following features: x-coordinate, y-coordinate, time stamp, pressure, finger area, x-velocity, y-velocity, x-acceleration, y-acceleration, z-acceleration, x-gyroscope, y-gyroscope, and z-gyroscope.

4. Methods

In order to assess the authentication performance based on pseudosignatures, both a function-based and a feature-based verification system were implemented. Features used by signature verification systems can be local and global ones. Local features correspond to sample points along the signature’s trajectory (e.g.,point-wise pressure). Global features are computed from the signature as a whole (e.g.,duration). Function-based systems use local features and feature-based systems use global features.

4.1. Function-Based Verification

The first system is based on DTW (dynamic time warping), and it compares the captured time sequences and (Algorithm 1). Each signature is represented as a time sequence with , , where is the number of local features. In this work, the following local features were employed: the x, y coordinates, the first-order differences, the second-order differences, and the pressure and its first-order difference. Before computations, the features of time sequences were standardized , where and are the mean and standard deviation for the ith local feature computed over all sampling points of the signature). The Euclidean distance function was used to compute the distance between two elements of the time sequences . Finally, the obtained DTW distance was divided by the sum of the time sequence lengths .

(1) procedure DTWDist ,
(3)  for do
(5)  end for
(6)  for do
(8)  end for
(10)  for do
(11)   for do
(15)   end for
(16)  end for
(18) end procedure

The verification process works as follows: in the enrollment stage, a set of N reference signatures are selected . In the verification stage, the DTW distances between the test signature and all the reference signatures are computed, and the final score results as the average of these distances. Finally, this distance-based score is transformed into a similarity score using (2):

The architecture of our function-based system is shown in Figure 3.

4.2. Feature-Based Verification

The second type of verification system is a feature-based or a global system, which computes a fixed size feature vector from each signature. Each feature is a global feature related to the signature as a whole. The components of our feature-based system are the following: feature extractor, user template creation, and matcher. No preprocessing was applied. However, before computations, features were scaled (separately for each user). The architecture of our feature-based system is shown in Figure 4.

Our feature extractor computed position- and time-based features [12, 22], such as duration and different types of velocity, as well as different types of sign change, a few sensor-related features (touchscreen pressure and finger area), and 2 types of histogram-based features. In the computation of histogram-based features, we followed the work of Sae-Bae and Memon [10] except that we used fewer features.

Let and be the coordinates of a signature and the pressure attribute. We compute the first- and second-order differences of these sequences as follows: , , and , where , and , , and , .

Angles , were computed in order to derive a uniform-width histogram from this sequence. The trigonometric function is a common variation of the standard function, which produces results in the range , . Angles characterize the shape of the signature. This interval of angles was divided into 8 equal bins in order to compute the histogram.

The , sequence captures the speed distribution, and it is considered very useful in combating skilled forgeries [10]. In the histogram computation, we considered only the values from the interval , where μ represents the mean and σ the standard deviation of the r sequence obtained from a single signature. This interval was divided into 16 equal bins for the histogram computation. The full list of the features is presented in Table 2.

In feature-based methods, each signature is representedby a D-dimensional feature vector. After selecting the signatures used for template creation, features were scaled.We applied the normalization for each feature before each template creation. In order to be able to apply the same scaling to the test signature before matching, the scaling parameters (max and min values for each feature) were stored as part of the user’s template. User-based scaling of the data meets real biometric verification systems, because the system does not take into account the data of other users. The performance of the system is increased significantly by normalizing all users’ data in a single step. This type of normalization was already evaluated for the DooDB database [23].

In data mining, anomaly detection (also outlier detection) is the identification of observations which do not conform to an expected pattern in a dataset [24] and therefore can be used for impostor detection. Killourhy and Maxion [25] evaluated 14 anomaly detectors for keystroke dynamic biometrics.

Four anomaly detectors were implemented for template creation and matching. Each anomaly detector uses a specific template and score computation (see below). Dissimilarity scores were transformed into similarity scores by using formula (2) as for the DTW algorithm.

In the training phase, user models (templates) are constructed using a fixed number of training samples. Testing or verification refers to an anomaly score computation for the test sample, which in our case is a distance from the user model.

The classic Euclidean anomaly-detection algorithm in the training phase calculates the mean vector of the training samples, while in the test phase it calculates the Euclidean distance between the mean vector and the test sample.

The Manhattan detector is similar to the Euclidean detector except that the Manhattan distance is used in the testing phase. Another variant of the Manhattan detector is the Manhattan scaled detector, described in Araujo et al. [26]. In the training phase, besides the mean vector of the training samples , the mean absolute deviation of each feature is also calculated . The anomaly score is calculated using the formula: , where x is a test sample , and the model comprises the mean vector of the features and the mean absolute deviation of the features: .

The k-nearest neighbour (kNN) detector works as follows: in the training phase, the detector saves the training samples (reference-based system). In the test phase, the detector calculates the Euclidean distance between each of the training samples and the test sample. The anomaly score is calculated as the mean of the distances to the k-nearest training samples.

One-class SVM is an outlier detector based on the support vector method. We used the LibSVM implementation [27] provided by e1071 package in R, with the following parameters: type = one-classification (for novelty detection), kernel = “radial,” and gamma = 0.04. The SVM parameter nu was set to 0.4, and parameters were not tuned for individual users.

5. Benchmark Results

5.1. Evaluation Protocol

The same evaluation protocol was used for both verification systems, which consists of three measurements. The systems were trained with the first 5, 10, and 15 samples from the first session, respectively. The 15 genuine signatures from the second session were used for positive score computation. All of the 20 available forgeries per user are used for skilled forgery score computations. Random forgery scores were computed for each user by using the first genuine signature from all the other users (Table 3).

Two types of EERs were computed. The global EER was computed based on a global threshold, which was computed using common genuine and forgery score lists for all users of the database. The second type of EER, the a posteriori user-specific EER, is reported in some recent papers [13, 28]. This type of EER is computed by averaging the user-specific EERs. User-specific EERs are computed independently for each subject of the dataset and therefore are based on user-specific decision thresholds. Martinez-Diaz et al. [13] used the notation of for this type of EER. As we compare our results with their results, it is important to use the same notation. However, it is important to note that this type of EER is based on a posteriori user-specific thresholds [29]. Hence, the corresponding EER (aEER) represents the best global EER that would be obtained if optimal score normalization techniques were known a priori [13].

5.2. Results
5.2.1. Training Set Size

The effect of the number of training samples during enrollment was investigated. Three cases were evaluated for each type of local and global features: using 5, 10, and 15 samples during enrollment.

Table 4 shows the function-based system results for the MOBISIG database. The five types of local feature sets were as follows: (i) —the coordinate sequence; (ii) —the first-order differences of the coordinate sequence; (iii) —the second-order differences of the coordinate sequence; (iv)—coordinate sequence with first- and second-order differences; (v) —coordinate sequence with first- and second-order differences and pressure with first-order differences. Both types of evaluations, skilled and random forgeries, are reported.

In the case of skilled forgeries, the best global EER was obtained by using the time sequences only (20.82%). However, the best aEER was 5.81% using type (v) local features.

In the case of random forgeries evaluation, the best result was obtained by using the first differences of x and y time sequences (EER: 1.41% and aEER: 0.01%). The very low EERs obtained show that this type of function-based verification system is highly reliable against random forgeries.

As expected, the more samples used in the enrollment phase, the better the verification system performance was. This is true for each type of local feature and for both types of evaluations: skilled and random forgeries. However, the improvements are small, especially between using 10 and 15 samples for enrollment.

The results obtained for the feature-based system are reported in the following. In order to show the contribution of different categories of global features, we formed three feature sets from the features shown in Table 2: : all features; : {pressure and finger area features}; and : {histogram-based features}.

Table 5 shows the feature-based system results for the MOBISIG database using all features (feat62). As for the function-based system, we report results for using 5, 10, and 15 samples for enrollment. Similar to the function-based system, using more samples for enrollment improves the performance of the feature-based verification system. The performance gaps are higher for cases between 5 and 10 samples than for cases between 10 and 15 samples.

Comparing Table 5 with Table 4, it can be observed that the DTW system achieved far better performances in the case of random forgeries than the anomaly detectors. In the case of skilled forgeries, both the best global EER (14.31%) and the best aEER (9.35%) were obtained by the Manhattan detector. Interestingly, the differences between global EER and aEER are not so large as in the case of DTW system. The ROC curves for the global EERs are shown in Figure 5.

5.2.2. Global Feature Sets

The results obtained for the three global feature sets are shown in Table 6. The best (lowest) error rates were obtained by using the Manhattan detector both for skilled and random forgery evaluations. In the case of this detector, using less features resulted in higher error rates. However, not all detectors were affected negatively by using less features. For example, the SVM detector performance increased by reducing the dimension of the feature set, especially when the pressure-related features were excluded .

5.2.3. Intersession Variability

We evaluated the intersession variability for the MOBISIG dataset. Two cases were evaluated: (1) using session 1 for training and session 2 for testing and (2) using session 2 for training and session 3 for testing. The results for using 15 samples for enrollment are shown in Table 7. The performance differences between the two cases are between 0.16% and 3.30% and are highly dependent on the used method. For example, verification results for system-wide EER are improved for the DTW (type (ii) features) method in case of session pair 2-3 (compared to session pair 1-2), but this tendency is not similar for aEER value in the case of the Manhattan detector.

According to the results from Table 7, no general tendency of improvement or deterioration of verification results could be stated. The results might suggest similar signature quality in session pairs 1-2 and 2-3 considering verification performance, but further investigations are necessary.

6. DooDB and MOBISIG comparison

In this section, comparative results are presented for our MOBISIG database and the other publicly available DooDB database. The results are presented along (i) their statistical properties, (ii) the verification system performances, (iii) score distributions, and (iv) some signature quality metrics.

6.1. Statistical Properties

Table 8 presents the most important characteristics regarding the data collection process of the two publicly available mobile device context signature databases.

6.2. Verification Performance

In the case of function-based DTW system, only the coordinates and their first- and second-order differences were used (no pressure information is available in the DooDB database). The pressure-based features were omitted for the feature-based system, consequently measurements were performed on and feature sets. The DooDB database contains coordinate values in the case of sampling errors. In order to correct this error, coordinate values from the previous sample were assigned to the sample.

The same evaluation protocol was followed as presented in Section 5.1. We present the results obtained by the two verification methods on the DooDB database (Tables 9 and 10), followed by the comparison of the results obtained on the two databases (Table 11).

Comparing the skilled forgery results by the DTW method (Tables 4 and 9), we observe that using more samples for reference signatures decreases both global and user average EERs. Among the , , and features, the first-order differences provide the lowest aEER for both databases (7.34% for MOBISIG and 16.67% for DooDB). However, combining all features , the aEERs become slightly better (6.27% for MOBISIG and 15.78% for DooDB). Kholmatov and Yanikoglu [30] also reported that the first-order differences had given the lowest error rates. The worst results were obtained by using thesecond-order differences () alone, which means that these features do not contain too much user-specific information.

The effect of using a reduced feature set is shown in Table 12. It can be seen that not using the histogram-based features resulted in a very small performance degradation (about 1%).

The ROC curves for some verification method are shown in Figure 6.

6.3. Score Distributions

The score distributions of genuine and impostor samples (skilled forgeries case, 15 training samples) for both verification methods are shown in Figure 7.

6.4. Signature Quality

Alonso-Fernandez et al. [31] reviewed the state of the art in the biometric quality problem. They consider that a biometric sample is of good quality if it is suitable for personal recognition. According to the ISO/IEC 29794-1 standard, biometric quality has three dimensions: fidelity, utility, and character. Fidelity is the degree of similarity between the sample and its source. Utility is related to the impact of the sample on the overall performance of a biometric system, while character indicates the inherent discriminative capability of the source.

Influence of the character of a biometric sample on its utility was investigated by Müller and Henniger [15]. In case of signatures, recognition systems will give different results depending on the reference samples (the sample set used for template creation). Selection of samples used in the template can be controlled by sample quality assessment algorithms. Evaluating the quality of a single sample based only on sample features is difficult, but methods exist to evaluate considering other samples. Some a posteriori methods use only genuine samples, others use skilled forgery samples also. Using a set of genuine samples for quality assessment still restricts selection of reference samples to the genuine user and assures usage of the method in real applications. In consequence, we used two metrics which examine the character of the sample proposed by Müller and Henniger [15].

The first method, the , consists of computing the EER for each genuine sample. This is obtained by computing the distances to all the other corresponding genuine and forgery samples. We formed a positive and a negative score lists from the obtained distances (scores) and computed the EER. We computed DTW distances between signatures using type (i) local features.

The second metric, the , was computed as the average of the pairwise distances (DTW distance—type (i) local features) of genuine samples for each user (in the case of having N genuine samples, we computed distances). The lower it was, the more similar the samples of the user were, so the more stable the user’s signature was. For both databases, we used the first genuine samples for computations.

After applying the two metrics for each genuine sample or each user , we obtained two sequences of values. In order to characterize the dataset, both the mean and the standard deviation of each sequence were computed. The obtained results are shown in Table 13.

From the two quality metrics proposed by Müller and Henniger, we favor the because this metric does not rely on forgery samples; therefore, it could be used during data collection. Samples that are not closely similar to those already collected should be discarded.

7. Conclusion

In this paper, the MOBISIG finger-drawn online signature dataset has been presented. The dataset comprises pseudosignatures from 83 users, both genuine and forgery samples. Benchmark verification experiments have been performed using both function and feature-based methods. Two types of EERs are reported: one is based on global threshold, while the other one on a posteriori user-specific thresholds. However, the global threshold-based EER results are not outstanding. Nevertheless, further improvement may be obtained by score normalization techniques. Good results were obtained using a posteriori user-specific thresholds (aEER). The lowest aEER for the skilled forgery case was 5.81% and 0.01% for the random forgery case. Both results were obtained by the function-based DTW method. Although feature-based methods offer poor results in the case of global threshold, they are significantly better than using function-based methods (skilled forgery case). The lowest aEERs were 9.35% (skilled forgery) and 2.10% (random forgery), both obtained by the Manhattan outlier detector.

The second objective of this paper was to compare our new dataset with similar publicly available ones. The only publicly available dataset collected on mobile devices and containing finger-drawn signatures was the DooDB. Therefore, we have presented a comparison of our MOBISIG and the DooDB dataset along (i) their statistical properties, (ii) the verification system performances using exactly the same methods and features, and (iii) some signature quality metrics. Signature quality measures indicate slightly better values in the case of the MOBISIG dataset. This difference is also expressed by better results in all evaluations in this study for the MOBISIG dataset compared to the DooDB. These results might be explained by the larger touchscreen for data acquisition in the case of the MOBISIG dataset, furthermore by the average duration of individual signatures in MOBISIG dataset (twice as long as those in the DooDB dataset).

Future work may include the introduction of new signature quality metrics for online signatures, as well as their evaluation on several datasets.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


The research has been supported by Sapientia Foundation–Institute for Scientific Research.