Abstract

Contactless authentication is crucial to keep social distance and prevent bacterial infection. However, existing authentication approaches, such as fingerprinting and face recognition, leverage sensors to verify static biometric features. They either increase the probability of indirect infection or inconvenience the users wearing masks. To tackle these problems, we propose a contactless behavioral biometric authentication mechanism that makes use of heterogeneous sensors. We conduct a preliminary study to demonstrate the feasibility of finger snapping as a natural biometric behavior. A prototype-SnapUnlock system was designed and implemented for further real-world evaluation in various environments. SnapUnlock adopts the principle of contrastive-based representation learning to effectively encode the features of heterogeneous readings. With the representations learned, enrolled samples trained with the classifier can achieve superior performances. We extensively evaluate SnapUnlock involving 50 participants in different experimental settings. The results show that SnapUnlock can achieve a 4.2% average false reject rate and 0.73% average false accept rate. Even in a noisy environment, our system performs similar results.

1. Introduction

Password, fingerprint, and face recognition have proved their commercial successes in the user authentication field. However, even though these techniques are widely deployed in commercial terminals, they still have limitations in our daily use. For example, passwords and PINs have contradictory issues: short passwords are insecure (e.g., smudge attacks [1] and shoulder surfing [2]), and long passwords are cumbersome and hard to remember. Biometric authentications, such as fingerprint and facial recognition, are more secure and user-friendly choices. Nevertheless, fingerprint authentication devices, such as digital door locks, are usually used by more than one person. It is a health and safety issue that people verify their fingerprints on the same panel. Furthermore, the skin condition of fingers also makes authentication a challenge. For example, after hand washing, moist fingers lead to a high false reject rate. As to face recognition, wearing a mask is an effective measure to reduce the risk of coronavirus infection during the COVID-19 epidemic. However, it brings difficulty to face recognition.

Recently, behavioral biometric authentication has been a popular research topic due to its inimitable property. For example, gait-based approaches [3, 4] use wireless signals or cameras to capture the motion posture for authentication. Brain waves [58] are more trusted mechanisms since it is controlled by the human’s unique brain. Beyond these, we raise this question: can we explore a new contactless behavioral biometric scheme that weighs security and convenience?

To address the above question, this paper proposes SnapUnlock, a novel biometric user authentication scheme based on finger snapping. SnapUnlock leverages a smartwatch to capture acoustic and hand motion while the user snaps her/his fingers. Then, the smartwatch is acted as a transmitter. It transmits the detected finger-snapping event to the cloud server for authentication. Based on the authenticated results, the target device decides whether to unlock (see Figure 1). In principle, SnapUnlock extracts unique features from the sounds and the motion generated by a user’s finger-snapping gesture. The key idea is based on the observation that sound and vibration patterns produced by finger snapping can serve as a unique signature for a person. To evaluate the feasibility of such a unique biometric feature, we first conduct a preliminary study to clarify the uniqueness across individuals. After scrutinizing feasibility, we leverage the heterogeneous sensors (microphone, IMU) to capture experiment data. The wide deployment of them on smartwatches makes our system easy to implement and ubiquitous. In addition, heterogeneous sensors compensate for each other’s weaknesses and enhance the noise tolerability of the system. To process the unaligned readings from sensors, we design a fusion and normalization approach; the raw readings are properly preprocessed before subsequent operation. After, we design a contrastive pretraining workflow for the signature extractor and a supervised learning phase for authentication.

SnapUnlock brings about the following advantages. (1) Ubiquity: it relies on a microphone and accelerometer which are readily integrated into the smartwatch. (2) Usability: finger snapping as a behavior has been proved its socially acceptable and user-friendly properties in previous research [9]. (3) Security: compared with depending on a single voiceprint feature, the combination of motion and sound is more resilient to replay attacks. The smartwatch only serves as a collection and transmission device; it does not store the user’s authentication information. Therefore, even if the device is stolen, the attacker cannot pass the authentication.

We implement the SnapUnlock prototype system on the Android platform. We collected more than finger-snapping data from volunteers. The evaluation shows that SnapUnlock achieves average false reject rate and average false accept rate. In summary, this paper makes the following contributions: (i)This paper proposes a contactless biometric authentication system to leverage sound and motion performed by finger snapping. We demonstrate the diversity of finger snapping captured by heterogeneous sensors across individuals and their consistency for the same individual. The measurements serve as an empirical feasibility study of finger snapping as a new biometric modality for authentication or access control(ii)We design and implement an authentication system that fuses readings from heterogeneous sensors, extracts acoustic and motion features, and accurately classifies the features. We also implement a model adaptation scheme to stabilize the performance over a period of time(iii)We build a prototype of SnapUnlock on the android platform. It takes us one and a half months to collect more than finger-snapping samples from users. Extensive experiments were conducted to evaluate the performance in various situations. The result shows that SnapUnlock is robust against different environments

The rest of this paper is organized as follows. In Preliminary Study, we conduct a preliminary study to demonstrate that it is feasible to use finger snapping as an authentication feature. In System Design, we introduce the detailed design of our SnapUnlock system. In Evaluation, we describe the experiment of data collection and evaluate the proposed system from different perspectives. Related Work and Discussion provide the related work and discuss the limitations and future work of our system. Finally, we summarize this paper in Conclusion.

2. Preliminary Study

2.1. Physiological Mechanism of Finger Snap

Finger snap is the act of creating a clicking sound with one’s fingers. While snapping fingers, we slide one finger against another with force; then, the middle finger gains more momentum and strikes the palm surface. The collision and friction will produce a sound and an arm vibration. According to the diversity of hand geometry biometrics (e.g., area/size of the palm, length, and width of fingers), this effect has a variance across different individuals. Prior works have employed static features as a kind of authentication feature [10, 11]. Moreover, the sound raised by the action of sliding finger surfaces against one another is called distinct fingerprint-induced sonic effect [12]. In Rathore et al.’s [13] work, they have proved that this effect is reliable in authentication.

2.2. Data Collection

For data collection, we develop an application for the Android Wear OS 2.0 operating system. As shown in Figure 2, we can select one or more sensors to record at the same time in the “select sensor” page (Figure 2(b)) in the application. Then, return to the main screen (Figure 2(a)) and click “record” to start recording. In this data collection experiment, we select a built-in microphone and accelerometer to measure the sound and the vibration. The device we used is Huawei Watch 2. It contains a  GHz quad-core CPU and M RAM. The sample rate of the microphone and IMU is  KHz and  Hz, respectively.

We recruit 25 volunteers (labeled as ) for data collection in an indoor environment. Among them, 8 volunteers were female, and 17 were male. Their age ranges from 19 to 35. The whole data collection takes about one month and a half. Before the data collection experiment, we explain the purpose and usage of SnapUnlock to volunteers. After that, each volunteer was asked to practice finger snap until he/she becomes skilled. In the data collection phase, each volunteer is asked to wear a smartwatch when they are snapping their fingers. We did not limit the volunteer’s posture and only asked them to behave as naturally as they do in daily life. Each volunteer was asked to provide 25 samples under a  dB noise level in each session. The experiment will be conducted twice each day. Moreover, to analyze whether the time duration will affect the result, intervals between each data collection day are varied. We label the experiment on the first day as Session 0 (S0). The volunteer will participate in the experiment again after 1 day (S1, S2), 3 days (S3, S4), 7 days (one week) (S5, S6), 30 days (one month) (S7, S8), and 45 days (a month and a half) (S9, S10), respectively. Note that every two sessions are conducted in the morning and afternoon, respectively (e.g., S1 and S2 are conducted in the morning and afternoon of day 1, respectively; S3 and S4 are conducted in the morning and afternoon of day 7, respectively). In summary, we collected a total number of .

2.3. Data Analysis

In this section, we present data analysis to validate the following assumption. (i)For the same person, finger-snapping signatures will not vary too much after a long term(ii)For different people, finger-snapping signatures show different signal patterns

Based on the above assumptions, we conduct intrauser and interuser analyses.

2.3.1. Intrauser Analysis

In this section, we analyze the time variation of finger snap in intrauser. Two volunteers (denoted as A and B) were randomly selected as analyzed targets from the collected dataset. Then, we use data in S0 as the target to calculate the average PSD (Power Spectrum Density) across the other 10 sessions. In each session, 25 instances were randomly selected to calculate with the instances in S0. The results of volunteer A are showed in Figures 3(a)3(d), and the results of volunteer B are showed in Figures 3(e)3(h).

Figures 3(a) and 3(e) are the average PSD of sound signals. We observe that even after a month had passed, people have consistent patterns of finger-snapping sounds. Figures 3(b)3(d) and 3(f)3(h) show the average PSD of accelerometer readings on three axes. We can see that the axis and axis patterns remain highly consistent, but there is a decrease in the -axis.

To further quantify the similarity of the PSD curve, we calculate the correlation coefficients for volunteer A in each session, and the results are shown in Figure 4. The average correlation coefficients of sound on all the sessions are above , and the accelerometer’s coefficients are relatively low, but they are all above . In summary, this result validates our assumption that finger-snapping signature will not vary significantly across different times during a long term for the same person.

2.3.2. Interuser Analysis

After intrauser analysis, we calculate the correlation coefficients of PSD curves for volunteers A and B in different sessions to validate their similarity. As shown in Figure 5, the PSD correlation coefficients of sound between volunteers A and B are below 0.5, which is lower than Figure 4(a). In addition, the data from the accelerometer on three axes (Figures 5(a)5(d)) also show similar results. In summary, this noticeable difference validates our assumption that finger snap shows different signal patterns for different people.

3. System Design

The system architecture of SnapUnlock is shown in Figure 6. There are two major parts: user enrollment and user authentication. During the user enrollment phase, the user needs to provide a number of finger-snapping samples. A pretrained LIMU-BERT model is applied to extract joint features from acoustic and accelerometer data. Finally, the joint features will be fed into a fully connective layer for prediction. We follow a non-end-to-end training strategy. In the pretraining phase, the SnapUnlock takes the size of targets in the dataset as a pretraining set. During user authentication, we keep the signature extractor parameters the same and only train the classifier with the target’s reference samples.

3.1. Finger-Snapping Event Detection

SnapUnlock targets real-time detection of the finger-snapping event even in a noisy environment. At the same time, event detection should be lightweight enough to run on wearable devices since they have limited computation capability and battery capacity. However, it is not straightforward to correctly detect events using the conventional thresholding method due to the complex environment. In this paper, we make use of the Constant False Alarm Rate (CFAR) [14] method to detect the start of each event. Essentially, CFAR is an energy-based adaptive thresholding method which adapts threshold value according to levels of external interference. We apply the detection algorithm to the IMU readings except for the microphone due to the fact that the sample rate (200 Hz) of IMU is much lower than the microphone (), and noise barely affects IMU. Assuming that the environment noise follows Gaussian distribution, we first use a sliding window of width to calculate the average energy of noise. Let and denote the average energy and its standard deviation at time , respectively. They can be formulated as where is the accumulated energy and is the overall standard deviation of signal power within a slide. Initially, and . and are formulated as where is the raw reading of IMU. Based on the above equations, a potential start point can be determined when meets the following condition: where is a constant which is independent of the noise level. From observation, we empirically set , as and , respectively. Even a higher lambda may cause more queried samples for authentication; it also means that the snapping finger event has a higher probability of being captured as a query sample. Because the classifier can distinguish which samples are invalid, a higher lambda does not influence the authentication performance and increases the user experience. After detecting a start point in IMU readings, we project the point-to-sound reading according to the sample rate ratio. Because both acoustic and motion signals are collected simultaneously, their starting point position should well synchronize. Figure 7 shows some acoustic samples detected under different noise levels. We observe that the duration of each event is around 0.42 seconds. Therefore, we cut the readings by a fixed window after the start point, where is the length of readings in each window and is the dimension of sensors. In our work, is 1 and is when is collected from the microphone. is 3 and is when is collected from the accelerometer, ensuring the alignment of each sample. By applying the algorithm, SnapUnlock can well extract the target event at different noise levels.

3.2. Background Noise Removal and Normalization

Background noise and the distribution variation of heterogeneous sensors are critical factors that impact the performance of the model. To address this, we first use a Butterworth filter with a passband of  Hz to remove background noises since we find most of the energy exists in the frequency band of in Figures 3(a) and 3(e). In addition, we keep the IMU readings the same since background noise does not affect the IMU. Second, SnapUnlock should handle multiple sensor data. But the readings of IMU sensors and microphone have different distributions (e.g., the amplitude of sound readings ranges from to , and accelerometer readings range from to ). Such differences would affect the model performance in the training phase. Features with a large distribution range will play a decisive role to dominate the descending gradient, while small distribution features may be neglected. It leads to an irregular contour plot of the loss function and thus slows the converges during training. Therefore, the sensor reading needs to be properly normalized before training. We adopt the min–max scaling method for normalization. The equation can be formulated as follows:

By applying background removal and normalization, the neural net can learn parameters better.

3.3. Signature Extractor

We expect our system to fuse the data stream from heterogeneous sensors and model the signature dependencies by a general feature representation. This feature representation should extract the signature of unseen users (none of their data is included in the extractor training phase). In another word, the feature modeled by the signature extractor should only relate to the signature for authentication. To this end, we design a contrastive learning flow which is an effective way to model default input as representations [15]. Figure 8 shows the contrastive learning phase.

Specifically, in this phase, a pretraining model learns representations by optimizing the principles and , where the anchor instance is from raw data in the pretraining set, positive instance is randomly sampled from the same participant’s reference sample, and negative instance is randomly sampled from other participants in the same training batch. In other words, the goal is to close semantically relative instances (positive pairs) and to take apart nonrelatives (negative pairs). Figure 9 shows the detail of the contrastive learning phase. Let , , and denote the representations of , , and , respectively. They are 4 outputs from the same signature extractor. We use LIMU-BERT [16] as a signature extractor and multilayer perceptron (MLP) as the final layer for prediction. The key operation is that we place independently sampled dropout masks on attention probabilities (default ) and MLP with identical positive pairs and negative. The MLP will predict true or false according to the pair type. We use infoNCE [17] as the loss function: where is minibatch size, is a temperature hyperparameter, and is the cosine similarity .

Before pretraining, we apply the MFCC (Mel-Frequency Cepstral Coefficients) on sound readings. As shown in Figures 3(a) and 3(e), most of the snapping sound energy is concentrated in the range that is audible (). We chose to use the MFCC because it simulates the human auditory system by spacing the frequency bands, and it makes the signal less time-sensitive and easier to obtain valuable information. We first calculate MFCC from each segment with a Hamming window. The size of Hamming windows is , and there will be samples overlapping between each adjacent segment. The output size of the MFCC function on the sound readings is . We concatenate preprocessed sound and IMU readings as a vector for pretraining. The output of the extractor is empirically set to according to the experiment in Impact of Representation Dimensionality. After the contrastive pretraining phase, the trained extractor is able to extract authentication-related representation. We use a trained extractor as the initial parameters in the supervised learning phase. It connects to a new classifier for a new user.

3.4. Classifier

As shown in Figure 8, we adopt another MLP as the final classifier in the supervised learning phase and testing phase for each new user. In the supervised learning phase, we train the MLP with the reference data, and the signature extractor is achieved from the contrastive pretraining phase. The function of the MLP is quite simple: it learns the boundary of reference representation. A questioned sample is projected to a high-dimensional space by the extractor, and the MLP decides whether it is true or false according to its representation in range.

3.5. Model Adaptation

As the human body changes with age, the shape of the palms will change gradually. The evaluation result in Section Effect of Model Adaptation also meets our assumption. Therefore, we design a model adaptation algorithm to update the model periodically. Considering the uncertainties associated with hand shape changes in each individual at different moments of growth, we design a confidence ranking algorithm that can select a legitimate sample with a high confidence level at the end of each authentication. The algorithm is listed in Algorithm 1. Selected high confidence legitimate samples will be added to the training pool. The model will be retrained at regular intervals. This eliminates the need for users to retrain and automatically adds up-to-date legitimate samples to the model.

Input:
Output:
1;
  // 
2
3
4 for to do
5  ;
   // 
6  
7 end
8;
  // 
9 for to do
10  ifthen
11   
12  end
13 end
14 return HCS

4. Evaluation

This section discusses the performance results of SnapUnlock from different perspectives.

4.1. Experiment Setup

We implement the SnapUnlock prototype on the Android platform (Huawei Watch 2) and a desktop. The overall authentication flow is trained off-line on a desktop with four-core Intel® Xeon® E3-1231 CPU, 16 G RAM, and RTX 2070s GPU which is running Windows 10 with Matlab R2020a software. We conduct experiments on real-world datasets which are mentioned in Data Collection. There are two tasks for biometric security systems. One is the recognition of unauthorized users, and the other is incorrectly identifying an authorized person. Therefore, we use FAR and FRR as metrics to quantify the performance of SnapUnlock. (i)FAR is short of false accept rate. It shows the likelihood that an authentication system incorrectly accepts an unauthorized user. An authentication system with a higher FAR is more secure than a lower one(ii)FRR, the short for false reject rate, is the ratio between incorrectly rejected attempts of legitimate users and all rejected samples. It depicts how user-convenient an authentication system is. An authentication system with a higher FRR means legitimate users need to pay more trials to gain access

4.2. Overall Performance

Since SnapUnlock leverages heterogeneous sensors, it is necessary to evaluate the overall performance across different sensors. Figure 10 illustrates the comparative performances of SnapUnlock trained and tested with a single and combination of sensors. 30 participants are involved in the pretraining phase, and 20 participants are in the training and testing phases. For evaluation, we randomly divide the sample from each participant into the training () and test () sets. Each positive sample pairs with a randomly sampled native from other participants in pretraining and training. We make sure that all participants in the training and testing phases are unseen in pretraining. According to the results, SnapUnlock with heterogeneous sensors outperforms others by the largest margin of 5.4% between heterogeneous and motion cases, with the effectiveness of heterogeneous sensors. Specifically, heterogeneous sensors achieve FAR and FRR. On the other hand, the FAR and FRR are (, ) for the microphone and (, ) for the accelerometer, respectively. We suspect that the heterogeneous readings contain more information related to snapping behavior than single.

In summary, the performance of heterogeneous sensors is significant. The authentication system can gain improvement not only from acoustic but also from motion sensors.

4.3. Impact of Reference Number

Reference refers to the samples that are registered as templates in the authentication system before being first used for a new user. The growth in the number of reference samples improves accuracy and reduced user experience. We keep the pretraining setting the same, and the reference number in training varies from 3 to 30. Figure 11 depicts the results of FAR and FRR. As shown clearly, SnapUnlock is able to achieve higher accuracies as the reference number increases. The FRR rapidly decreases from to , and FAR slightly increased from to . We also observe that the slopes of FAR and FRR start to slow down when the reference sample increases from 23 to 30. Therefore, we suggest setting a reference number higher than 23 in the application.

4.4. Impact of Participant Number

Participant numbers influence the diversity of the pretraining set. With the growth of diversity, a pretrained signature extractor is able to extract a more appropriate representation for authentication. To investigate the impact of participant numbers in representation learning, we keep the pretraining setting the same as in the previous section, and the participant number in pretraining varies from 1 to 40. When the participant number increases, we decrease the training pair amount of each participant so that the total amount of the pretraining set is consistent. Figure 12 depicts the results of FAR and FRR. With the increment of participants, the FRR rapidly decreases from to , and FAR decreases from to . When the number of participant number reaches 30, it stabilizes. Overall, the results show that signature learning significantly benefits from the diversity of participants, and SnapUnlock needs about 30 participants for pretraining.

4.5. Impact of Training Pair Number

We have demonstrated that increasing pretraining set diversity is able to improve performance. In this section, we examine how training pair numbers influence our model with the same diversity in the pretraining phase. In the pretraining phase, we ensure the participant number is 30 and vary the training pair ratio of each participant from to . Figure 13 depicts the results of FAR and FRR. The results show that increasing the training pair number obtains lower FRR. The FRR decrease from for to for . However, the FAR slightly decreased from to . Combined with the result in Impact of Participant Number, it is suggested to vary data diversity than increase the pretraining set of each participant.

4.6. Effect of Model Adaptation

We evaluate the effect of model adaptation by varying the time. As mentioned in Data Collection, we have collected data with day variance. In this experiment, we train the classifier with the data collected on the first day (S0). The test data are split as S1, S2 (after 1 day), S3, S4 (after 3 days), S5, S6 (after 1 week), S7, S8 (after 30 days), and S9, S10 (after a month and a half), respectively. We show the result in Figure 14. As we can see, the FRR increase from to as the day passes without model adaptation, and the FAR stabilizes in the range of to . We assume that the human skeleton grows or shrinks over time. Therefore, SnapUnlock tends to distrust the long-term sample. With the help of model adaptation, SnapUnlock gains the ability, which learns the variance of snapping behavior. It greatly stabilizes the performance of SnapUnlock.

4.7. Impact of Noise

This section analysis how SnapUnlock performs under different noise levels. Similar to Overall Performance, we keep the participant’s number the same in each phase, pretrain and train the model with data collected under , and test in three noise levels (e.g., , , and ). The results are illustrated in Figure 15. Compared to the model trained with microphone data, the model trained with heterogeneous data performs better as the noise level increases. Even when the noise level reaches , the performance with heterogeneous data degrades slightly by the margin of for FAR and for FRR. The degraded margins with microphone are for FAR and for FRR. Overall, SnapUnlock reinforces the resistance to environmental noise with the assistance of an accelerometer.

4.8. Impact of Hand Surface State

In this section, we evaluate the impact of the hand surface state. Moisture and dryness on the skin change the friction of the skin when the finger is snapped, leading to changes in the acoustic signal. To evaluate this, we compare the performance in both wet and dry hand surface states. Each participant’s authentication model is trained with the standard setting, but the testing dataset is collected under different hand surface states (e.g., wet, dry, and normal). In the wet situation, we collect data after the participants washed their hands. Then, we let the participants dry their hands with a hand dryer, ensuring no other substances were on their hands. The data collected after this phase are marked as dry. The normal data are randomly selected from the dataset as mentioned in Data Collection. Figure 16 shows the results. There is no significant difference between the normal and dry states. However, while the hand surface state is wet, the performance will decrease. One possible reason is that the acoustic feature’s weight in the joint feature is higher than the motion feature. Therefore, we recommend that users do not get their hands wet when using the SnapUnlock system.

4.9. Impact of Representation Dimensionality

In this section, we investigate how the representation dimensionality affects the model performance. Figure 17 illustrates the FAR and FRR of the feature extractor with the representation dimensionality varying from to while the other optimal hyperparameters are unchanged. It is easy to observe that the best performance appeals at dimension 32. And, when the dimension is higher than , the performances decrease. The result demonstrates that a large representation dimensionality does not benefit model improvement. Considering that the best result was achieved with dimension , we use as the output dimension.

4.10. User Experience Study

In addition to validating SnapUnlock effectiveness, we also assess its users’ experience. We conduct a user experience survey on 50 participants. All of them have previously used passwords and fingerprints as authentication schemes. Out of the 50 participants, 25 are involved in previous experiments. The rest of the 25 users use SnapUnlock for the first time. Guided by the tutorial, we found that all 25 new users mastered the finger-snapping technique within an hour. We first informed all the participants of the aim of the study and showed them how to use SnapUnlock. They were asked to install the software on a smartwatch and use SnapUnlock for authentication in unlocking personal computer scenarios. After the trial, we distributed a questionnaire to all 50 participants and collected feedback from each participant. The questionnaire consists of the following questions. (i)By jointly considering the accuracy, robustness, and usability, please rate an overall score (from 0 to 5) to these three unlocking schemes (SnapUnlock, password, fingerprint, and faceID) (0 means worst; 5 means best)(ii)Are you willing to use the three authentication schemes (SnapUnlock, password, fingerprint, and faceID) daily? Please rate a score from 0 to 5 (0 means I never want to use it daily; 5 means I would certainly use it in public)(iii)How difficult learning snap finger is do you think? Please rate a score from 0 to 5 (0 means easy; 5 means difficult)

Figure 18 shows the results. The average overall scores are , , , and for the SnapUnlock, password, fingerprint, and faceID, respectively. The average willingness ratings are , , , and for the SnapUnlock, password, fingerprint, and faceID, respectively. It shows that SnapUnlock is more acceptable than other methods. In the end, the average difficulty ratings are , , , and for the SnapUnlock, password, fingerprint, and faceID, respectively. Even though our method requires a period of practice for new users, the results demonstrate that it is still slightly better for passwords. Most of the participants say this is due to the need to remember complex password combinations for password security.

We categorize SnapUnlock’s research into three subsections: physiological biometric authentication, behavior biometric authentication, and nonspeech body sound sensing.

5.1. Physiological Biometric Authentication

Physiological biometric features can be easily quantified into digital data by different types of sensors embedded in mobile or wearable devices. Prior works [1821] explore the possibility of applying in authentication by proving their uniqueness. Fingerprints [18] can be easily captured by fingerprint sensors, so it is widely used in mobile devices. However, due to the shortcomings of the fingerprint sensors, such as squeezing screen space and failure in case of water on fingers, fingerprint authentication has been abandoned by some mobile devices (e.g., iPhone [22]). Face recognition [1921] also is one of the most popular authentication approaches. However, the camera and other relative sensors cut the screen into a notch shape, bringing a negative user experience to customers [23]. On some special occasions (e.g., wearing a protective mask), face recognition may fail to verify the legitimate user. Compared with the physiological biometric feature, SnapUnlock requires no additional sensor costs and achieves satisfactory performance.

5.2. Behavior Biometric Authentication

Behavioral biometric authentication has attracted more attention in recent years. Gesture-based [2426], keystrokes-based [27, 28], and gait-based [2931] authentication is a popular topic in recent years. Their efforts exploit including but are not limited to accelerometer, keyboard, and touch-screen to extract unique features such as movement speed, rhythm, and other properties from particular behaviors. Besides, some researches [3236] focus on exploring other new behavior biometric features. BreathPrint [32] records the sound of breath from the user’s nose to verify legitimate users. BreathLive [33] extracts features from both sounds and motion caused by deep breathing to realize a reliable authentication system. Wang et al. [34] press the phone on the user’s chest, and it measures the heartbeat signal using the inertial accelerometer in smartphones to perform authentication. Bilock [35] innovatively leverages the sound of dental occlusion (i.e., tooth click) to authenticate the user. Brain waves [36] also show their potential in the authentication field. SnapUnlock introduced a new behavioral biometric mechanism, finger-snapping, which is usable and resilient to replay attacks and can be easily captured by an accelerometer and microphone in a commodity smartwatch to build a stable biometric authentication system.

5.3. Nonspeech Body Sound Sensing

According to the type of nonspeech body sound, we can utilize it in the field such as monitoring health status or activity recognition. Prior works [3741] try to extract useful information from these nonspeech body sounds for different purposes. Bodyscope [37] develops an acoustic-based wearable system to record the sound by placing a custom Bluetooth headset in the area of the user’s throat. This system can classify different types of nonspeech body sounds, such as eating, drinking, breathing, speaking, laughing, and coughing at around accuracy. Bodybeat [38] is a mobile sensing system that can capture a diverse range of nonspeech body sounds to recognize physiological reactions. Similar to Bodyscope [37], it places a custom piezoelectric microphone near the user’s throat. They also develop a body sound classification algorithm to distinguish different sounds of human behavior with about accuracy. SymDetector [39] utilizes the built-in microphone on a smartphone to continuously monitor and detect four types of respiratory sounds (i.e., sneeze, cough, sniffle, and throat clearing). SleepHunter [41] and iSleep [40] both monitor the sleep quality using the microphone of the off-the-shelf smartphone. All of these works are aimed at using nonspeech sounds in the healthcare field. But SnapUnlock is interested in using nonspeech sound (finger-snapping) for user authentication on the smart device due to its unique property.

6. Discussion

In this section, we mainly discuss the limitations of SnapUnlock. An inherent limitation of our system is that users need to learn to snap their fingers. After observing those who did not produce sound due to incorrect posture, we found that they could still generate wrist vibration. Therefore, an alternative solution is to collect more data from those who are unfamiliar with finger-snapping. This way will increase the motion feature’s weight in the training phase, allowing motion features to have a higher weight in the joint feature during decisions.

7. Conclusion

In this paper, we present SnapUnlock, a touchless authentication approach that can unblock devices by leveraging finger-snapping gestures. We utilize the inherent correlation between wrist motion and sound caused by finger-snapping action and seamlessly integrate contrastive learning techniques into signature extractor learning to realize a reliable authentication system.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported in part by the National Natural Science Foundation of China Grant (No. 62172286) and Natural Science Foundation of Guangdong Province Grant (No. 2022A1515011509).