Abstract

Remote photoplethysmography (rPPG) can be used for noncontact and continuous measurement of the heart rate (HR). Currently, the main factors affecting the accuracy and robustness of rPPG-based HR measurement methods are the subject’s skin tone, body movement, exercise recovery, and variable or inadequate illumination. In response to these challenges, this study is aimed at investigating a rPPG-based HR measurement method that is effective under a wide range of conditions by only using a webcam. We propose a new approach, which combines joint blind source separation (JBSS) and a projection process based on a skin reflection model, so as to eliminate the interference of background illumination and enhance the extraction of pulse rate information. Three datasets derived from subjects with different skin tones considering six environmental scenarios are used to validate the proposed method against three other state-of-the-art methods. The results show that the proposed method can provide more accurate and robust HR measurement for all three datasets and is therefore more applicable to a wide range of scenarios.

1. Introduction

The heart rate (HR) is a widely used indicator of health status [1, 2]. At present, electrocardiography (ECG) or photoplethysmography (PPG) are well-established as reference methods for HR measurement [3, 4], but they require the application of electrodes or transducers to the skin during the measurement, which may cause discomfort in long-term monitoring and do not lend themselves well to personal use.

In recent years, as an alternative and less obtrusive method, remote photoplethysmography (rPPG) has offered a low-cost and contactless approach for measuring HR with a camera. In 2008, Verkruysse et al. first introduced ambient light photoplethysmography to implement noncontact HR measurement [5, 6]. In contrast to contact PPG, which in both reflection and transmission modes, uses a built-in light source, rPPG measures the variation with time of ambient light reflected from the skin, making it necessary to distinguish between the specular and diffuse reflections of light from the skin surface. The diffused component is the reflected light that remains after absorption and scattering in the skin, subdermal tissue, and blood. Consequently, it varies with changes in blood volume and the movement of the blood vessel wall [79]. The reflected light can be recorded with a variety of optical sensors, including cameras, and the plethysmographic signal can be retrieved to infer the physiological information of interest. Thus, the rPPG approach for noncontact HR monitoring and telemedicine is receiving more and more attention and, especially in recent months, for its potential application in isolation units to reduce the risk of cross-infection.

However, many factors influence the accuracy of rPPG-based HR measurement. These include the subject’s skin tone [10], head movement [10], rapid changes during recovery from exercise [11], variation in ambient illumination [12], and, when imaging the head, facial expression changes associated with talking, for example, during human-machine interaction [13]. To address these issues, many approaches have been proposed, the principal ones of which include signal decomposition, model-based methods, and data-driven methods.

1.1. Signal Decomposition Methods

In early studies, blind source separation (BSS) based on independent component analysis (ICA) [8] or principal component analysis (PCA) [14] was widely used to extract pulse information from the rPPG signal. The BSS methods usually assume that the pulse signal satisfies independence from the other components in the mixed signal and can be extracted by analyzing the correlation among its different components. Recently, some new signal decomposition methods have been reported. Cheng et al. proposed a joint blind source separation and ensemble empirical mode decomposition (JBSS_EEMD) method, which was found to be a feasible way to suppress the effects of varying illumination, by extracting the underlying common light sources [15]. Wei et al. used a second-order BSS method based on two regions of interest (ROIs) and obtained plausible estimates of both the heart and respiratory rates, although the ROIs were not detected automatically [16]. Macwan et al. presented a semiblind source method using a multiobjective optimization approach with autocorrelation as a periodicity measure, which focused on the subject’s skin tone and a human-machine interaction scenario, wherein subjects were engaged in a mathematical computer game [17]. Song et al. also proposed a new semiblind source approach using fast kernel density independent component analysis (KDICA), which is a nonparametric BSS method, with specific importance attached to different distances between the participant and the camera [18]. However, these signal decomposition methods only achieve high accuracy under a limited range of conditions, whereas the model-based methods outlined below are more robust for a more extensive variety of scenarios.

1.2. Model-Based Methods

At present, model-based methods are based on the physical principles, which govern the rPPG signal generation. There are two typical models: the skin reflection model [19] and the local invariance of HR model [20]. The skin reflection model-based methods use the color vector information of the different components of the rPPG signal to facilitate the separation of specular and diffuse reflections. De Haan and Jeanne presented a chrominance-based signal processing method, which used the chrominance subspace to extract the pulse signal, although they mainly focused on the exercise recovery scenario [11]. Wang et al. proposed a planar orthogonal-to-skin (POS) method based on a skin reflection model, and they considered subjects’ skin tone, recovery after exercise, head movement, and illumination by fluorescent lamps of different colors [19]. In a previous study from our laboratory, Qi et al. employed a skin reflection model with ICA to present the Project_ICA method and considered the effects of skin tone, exercise recovery, head movement, and human-computer interaction [21]. The local invariance of the HR model, developed by Pilz et al., converts rPPG signals to certain specific spaces and uses the local group invariance of HR to estimate its value. They have successively proposed the diffusion process method [13], the local group invariance method [20], and the spherical method [22] and investigated the effect of head movement and person-to-person interaction.

1.3. Data-Driven Methods

More recently, to address the limitations of conventional heuristic man-made features or engineered models, some researchers have used data-driven methods to infer HR through collected signal datasets. For example, Monkaresi et al. proposed a machine learning-based -nearest neighbor algorithm [23], using the rPPG signal features for regression analysis, considering both stationary and exercise scenarios. In addition, Chen et al. proposed a deep learning-based convolutional attention network: DeepPhys [24], and they focused on the subject’s skin tone, head movement, and person-to-person communication scenarios. However, data-driven methods require a large amount of training data to feed the network, so they have some limitations in real applications.

To verify the robustness of rPPG-based HR measurement, most of the previous studies have focused on various common scenarios, e.g., the subject’s skin tone, exercise recovery, and head movement, but less attention has been paid to poor illumination and illumination variation, both of which can cause interference in facial rPPG signal acquisition. Therefore, we have addressed these two illumination problems by means of a novel rPPG-based HR measurement method using a webcam. The system also functions well under a variety of other conditions, including exercise recovery, head movement, and human-machine interaction, and is therefore applicable to more scenarios than existing methods.

The main contributions of this paper are twofold: (1) We use both the signal decomposition method and the projection process based on a skin reflection model. This approach can eliminate the common illumination component from the facial and background ROI signals and extract the signal which carries the pulse rate information. (2) After ICA processing, to determine the output signal which carries the pulse information, we use the green channel signal in the facial ROI as a reference to identify the pulse signal. To validate the proposed method under a range of real-life conditions, we introduced for testing three datasets which include six different scenarios. The first dataset is from our previous study [21], the second is the Public Benchmark Dataset [25], and the third is a new dataset assembled for this study. We then compare the proposed method with three recently developed methods, using three metrics: Pearson correlation coefficient (), Mean Absolute Deviation (MAD) (bpm), and Root Mean Square Error (RMSE) (bpm).

The remainder of the paper is structured as follows. Section 2 describes the proposed method and the datasets. The experimental findings are presented in Section 3. The results are analyzed and discussed in Section 4, and conclusions are drawn in Section 5.

2. Methods and Materials

The proposed method includes the following steps: (1) The facial video is recorded with a standard RGB camera, and the facial and background ROI signals are extracted from each video frame. (2) The JBSS method is used to remove the common illumination component from the facial and background ROI signals, obtaining a new facial ROI signal. (3) The pulse signal is extracted from the new facial ROI signal using the projection method. (4) The green channel signal from the facial ROI is used to identify the pulse signal, and the HR is calculated using the fast Fourier transform (FFT). This method not only removes the illumination variation to enhance the extraction of the pulse information but also solves the problem of no ordering of the output components after ICA. The whole process is shown in Figure 1, and the specific details of the major steps are described in the following Sections 2.1 to 2.4.

2.1. Acquisition of Facial ROI and Background ROI Signals

In this step, the Viola-Jones face detector is used to detect a rectangular facial region in the first video frame, and the facial feature points in the rectangular facial region are then detected using the minimum eigenvalue algorithm [26]. Next, the Kanade-Lucas-Tomasi (KLT) tracking method [27] is utilized to produce the facial feature matching point pairs between the previous and current frames, and a rectangular facial region is generated in each frame, as Figure 2(a) shows. Then, the YCbCr [28] color space is used to detect the skin region, as shown in Figure 2(b). The skin region is treated as the facial ROI, and the nonskin background region is treated as the background ROI. Finally, we calculate the spatial RGB averages of all the pixels in each facial ROI and background ROI at time to obtain the facial ROI signal and background ROI signal as where is the facial ROI signal or the background ROI signal from the RGB channels, is the area of facial ROI or the background ROI, and is the pixel value of pixel in at time . is the total number of pixels in .

2.2. Joint Blind Source Separation

rPPG-based HR measurement is often affected by environmental illumination, which is taken as a common factor in the facial and background ROI signals. We employed the JBSS to remove interference caused by the illumination, separating the underlying common data sources into multiple datasets and maintaining a consistent sequence of the extracted sources in the multiple datasets. This method termed independent vector analysis Gaussian (IVA-G) [29] was selected to implement the JBSS framework, and the process of IVA-G is shown in Figure 3.

For a given set of matrices () as the datasets, each matrix has rows and columns. The -th matrix can be expressed as , . Assuming that each matrix can be composed of underlying independent sources as where is the mixing matrix and can be expressed as , and is the underlying independent source, as shown in Figure 3(a).

Furthermore, SCV, as defined in [30], can be obtained by extracting the underlying independent sources in each matrix , and the -th SCV can be expressed as . Each SCV is independent of all others, and the components within each SCV are correlated. In order to get the target SCV which is regarded as the common component across the given matrices for each SCV, we can compute a correlation matrix with distinct correlation values. (The correlation values are determined by the linear correlation between the components within the SCV.) Then, using these correlation values, we can analyze how close the components within each SCV are to each other. We assume that all the correlation values for the target SCV are sufficiently high, and we then determine the target SCV [31] by (1) counting the number of correlation values greater than an empirically determined threshold, (2) calculating the ratio of this number to the total number of correlation values for each SCV, and (3) choosing the maximum of this ratio. The procedure is shown diagrammatically in Figure 3(b).

In this study, the facial and background ROI signals are regarded as two matrices and , where and for the RGB channels, and is the number of video frames (see Figure 4). After IVA-G, the target SCV can be extracted and is assumed to be the common illumination interference. To remove this interference, each component in the target SCV is set to zero, and then, the resulting is substituted into equation (2) to get a new , which is the new facial ROI signal [15]. For the signal preprocessing, JBSS is applied to the whole facial ROI signal and the background ROI signal is extracted from each video clip.

2.3. Projection Process Based on a Skin Reflection Model

The new facial ROI signal contains the specular reflection component due to the direct reflection of illumination from the skin surface, and the diffuse reflection component after the reabsorption of the incident light by the skin and underlying tissue, including blood vessels. In order to extract the diffuse reflection component, which carries the pulse rate information, we use the projection method based on the simplified skin reflection model [19]: where is the skin pixel intensity from the RGB channels at time , represents the illumination variation component and is the unit vector, denotes the specular component, denotes the diffused component, denotes the stationary illumination component, and is the diagonal matrix that is used to normalize the steady component.

In order to reduce the signal dimensionality and eliminate the illumination variation component , the time-series signal is projected onto the specific orthogonal plane . The optimal plane was selected empirically by studying how the projection axes on the projected plane affected the quality of the projected signal, as performed in our previous study [21]: where and are orthogonal to each other and is the projection signal with two components and (see the left-hand section of Figure 5). After the projection process, the three-dimensional signal has been reduced to a two-dimensional signal , and is eliminated. and in are assumed to be independent of each other; thus, and can be decomposed using ICA, to retrieve and , as shown in equation (6): where is the unmixing matrix (see the right hand (ICA decomposing section) in Figure 5).

2.4. Calculation of Heart Rate

However, there is no ordering of and after ICA processing, so we have to determine which one is the pulse signal. In our previous study [21], we selected the POS signals [19] for linear correlation analysis with and and chose the one more closely correlated to the POS signal as. In contrast, some previous studies have pointed out that the green channel signal in the facial ROI contains more pulse rate information [5, 32]. Lin and Lin also determined that the green channel signal achieved the highest signal-to-noise ratio for a rPPG signal under a variety of different lighting conditions [33]. Consequently, we performed a linear correlation analysis between the green channel signal and the output signals after ICA and then chose the pulse signal which had the highest correlation with the green channel signal. After was selected, the components unrelated to HR were removed by applying a band-pass filter with cut-off frequencies of 0.7 Hz and 4 Hz, covering the normal range of human HR from 42 beats per minute (bpm) to 240 bpm [21]. Finally, an FFT was applied to the filtered pulse signal, and the heart rate was taken as the frequency where the spectral power was maximal. The process is shown in Figure 6.

2.5. Experiments and Datasets

To evaluate the algorithm’s performance, we selected three datasets for testing: data from a previous study in our laboratory [21], the Public Benchmark Dataset [25], and a dataset collected specifically for this study. All the computations were carried out on a laptop (Intel® Core™ i5-7300HQ CPU with 2.5 GHz processor speed and 8 GB RAM) running MATLAB® 2017a (The MathWorks, Inc.).

2.5.1. Data from a Previous Study in Our Laboratory

This dataset includes 112 different videos from 28 subjects with various skin colors (18 pale-skinned subjects and 10 dark-skinned subjects) [21]. The subjects were tested in four different scenarios: stationary, exercise recovery, swinging head, and interacting with a computer (playing video games) in a natural light environment. A standard RGB camera (C2070i, Logitech Inc., CA, USA) was used to record the subject’s face at 30 frames per second (fps) with a pixel resolution of , and each video record was one-minute long. A transmissive finger pulse oximeter (Yuwell™ YX303, Yuyue Medical Equipment and Supply Co., China.) was used to acquire the reference HR value, averaged over successive 10-second periods. Details of the four scenarios are shown in Table 1. More details of the experimental setup are described in reference [21].

2.5.2. Public Benchmark Dataset

We chose the Public Benchmark Dataset [25] because, unlike many others, this contains recordings under a variety of lighting conditions during both exercise recovery and head movement scenarios. It contains 21 videos recorded under different conditions from three subjects (one light-skinned and two dark-skinned). All videos were recorded with an RGB HD video camera (GZ-VX815BE HD, JVC Inc., Japan) at a frame rate of 30 fps and a pixel resolution of . The ground-truth HR was measured using a Mobi ECG device which is CE certified (class 2A, type CF). Since the length of each video clip ranged from 3 minutes to 3 minutes and 15 seconds, we processed the first three minutes of each video and treated this as 3 separate one-minute measurements for each subject. Each video has a filename coded by the specific subject and lighting condition, for example, the filename “P2LC3” means the video of subject #2 under lighting condition LC3. The labels and corresponding lighting conditions are listed in Table 2. More details of the experiment can be found on the dataset website https://osf.io/rwsx6/. Some stills from recordings of the head movement scenario are shown in Figure 8. For the other conditions, the subjects were required to look at the camera without moving.

2.5.3. Dataset Recorded Specifically for This Study

The Public Benchmark dataset considers different illumination intensities and color temperatures but does not consider variation in the background illumination. This variation can affect the accuracy of rPPG-based noncontact HR measurements [15]. For example, when a person is looking at the computer screen, the flickering light of the screen will be reflected on their face. Similarly, the illumination on a driver’s face in a traveling car will be continuously varying in the face of oncoming traffic and other changes in external illumination. To further investigate the effect of illumination variation on rPPG-based noncontact HR measurements, we established a varying illumination dataset which, in contrast to the Public Benchmark Dataset, used variations in color temperature rather than intensity. It contains 12 videos from 12 subjects (8 pale-skinned and 4 dark-skinned), all of whom gave their informed consent. The experimental protocol was approved by the ethical review board of our institute. A standard RGB camera (C2070i, Logitech Inc., CA, USA) was used to record signals for one minute at 30 fps with -pixel resolution.

All subjects remained stationary, exposed to a natural light background upon which was superimposed the additional illumination, switched between two color temperatures. The additional illumination was provided by an LED lamp (LED308W, Godox Inc., China), placed in front of the subjects at a distance of 1 meter, giving a light intensity on the face of 430 lux (manufacturer’s figures). The color temperature, set by a manual control unit, was switched between 3300 kelvin and 5600 kelvin every 10 seconds for a period of one minute, according to the schematic in Figure 9(a). A multiphysiological parameter monitor (NSC-M12P, Neusoft Inc., China) was used to acquire the reference HR signal using a pulse oximeter finger probe, and the reference HR value was averaged over successive 10-second intervals. The whole experimental setup is shown in Figure 9(b).

2.6. Comparison with Existing Methods

In order to evaluate the performance of the proposed method, we compared it with three other recently developed ones: JBSS_EEMD [15], POS [19], and Project_ICA [21]. All these methods were applied to the three datasets. To ensure a fair comparison between the different methods in the preprocessing stage, we used the method in Section 2.1 to extract the original facial and background ROI signals from the video. Then, the facial ROI signal was applied using POS and Project_ICA, and both signals were processed by JBSS_EEMD and the proposed method to estimate the HR. A 30 s time window was used to process each video. Adjacent time windows were overlapped by 29 s so that each one-minute video yielded 31 results. The procedure is illustrated in Figure 10. It is worth noting that in the preprocessing stage, the JBSS method was applied to the whole facial ROI signal (1 min long) and background ROI signal (1 min long), and then, the projection process was applied to all time windows (30 s long).

3. Results

To assess and compare the performance of the four methods on the three datasets, three evaluation metrics were adopted: Pearson correlation coefficient (), Mean Absolute Deviation (MAD) (bpm), and Root Mean Square Error (RMSE) (bpm): where and represent the subject’s estimated and ground-truth HR values, respectively, at the -th time window. In order to remove outliers in the estimated HR values, the Boxplot method was employed [34], by choosing 0.8 and 1.2 times the median value as the lower and upper limits, respectively, and removing values outside this range.

3.1. Results from the Dataset Used in a Previous Study from Our Laboratory

This dataset tests the performance of the algorithm under the conditions of no movement, exercise recovery, swinging head, and human-computer interaction. All the measurements were performed on pale- and dark-skinned subjects, and the results are summarized in Tables 3 and 4. The best and second-best agreements are marked with “” and “,” respectively.

Tables 3 and 4 indicate that, for most of the scenarios, the results of the proposed method and those of POS and Project_ICA are in good agreement, and these three methods outperform JBSS_EEMD as shown by its higher values of MAD and RMSE and lower values of . The proposed method had the best performance in two scenarios (exercise recovery and swinging head for dark-skinned subjects) and the second-best in one scenario (swinging head for pale-skinned subjects). Specifically, for the pale-skinned subjects, the POS and Project_ICA methods performed slightly better than the approach proposed here, under the stationary and human-computer interaction scenarios. Furthermore, the results of the proposed method are comparable with POS and Project_ICA under the exercise recovery and swinging head scenarios. For the dark-skinned subjects, the proposed method is more accurate than POS and Project_ICA under the swinging head scenario. The results of POS and Project_ICA are a little better than that of the proposed method under the stationary and human-computer interaction scenarios. Moreover, the proposed method performs as well as the POS and Project_ICA methods under the exercise recovery conditions.

To evaluate the consistency between the measured HR value and the ground-truth, we constructed Bland-Altman plots comparing obtained from the proposed method with for each of the four scenarios (see Figure 11), considering dark- and pale-skinned subjects separately.

Figure 11 demonstrates that, for the proposed method, the agreement between and under the stationary and human-computer interaction scenarios is better than that under the exercise recovery and swinging head scenarios. The proposed method confirms the superior performance obtained with the pale-skinned subjects under the stationary scenario, with narrower 95% limits of agreement (-6.03 to 4.73 bpm), and with the mean bias close to zero (-0.83 bpm). The results for the pale-skinned subjects under the human-computer interaction scenario show that the 95% limits of agreement are also narrower, ranging from -6.20 to 6.09 bpm with the mean bias close to zero (-0.05 bpm). For all scenarios, the agreement between the proposed method and the ground-truth was closer for the pale-skinned subjects than for the dark-skinned subjects.

3.2. Results from the Public Benchmark Dataset

This dataset has been used mainly to test the performance of the algorithm under various inadequate lighting conditions. In this study, a single light-skinned subject was tested under exercise recovery conditions and with their head moving at different frequencies. The two dark-skinned subjects were only tested under conditions LC1 to LC5, none of which involved the head moving. The results for the light-skinned subject are shown in Table 5, and those for dark-skinned subjects are shown in Table 6. The best and the second-best results are marked with “” and “,” respectively.

Tables 5 and 6 indicate that, for the 12 scenarios which involve the different illumination conditions (P1LC1-LC7, P2LC1-LC5, and P3LC1-LC5), the proposed method and JBSS_EEMD were more accurate than POS and Project_ICA, and that the proposed method has the best performance in six scenarios and the second-best in five. It is worth noting that under some low illumination scenarios, such as LC1, LC2, LC3, and LC7, the errors of POS and Project_ICA are much larger than those of the JBSS_EEMD and the proposed method. The results of the proposed method are comparable with POS and Project_ICA under the exercise recovery (P1H1) and swinging head scenarios (P1M1), and these three methods are superior to JBSS_EEMD. It should be noted that the results of all four methods were less accurate when subjects nodded their heads at a rate of 90 times per minute (P1M3) than when subjects nodded their heads 60 times per minute (P1M2).

For the proposed method, we constructed Bland-Altman plots to show the agreement of the and (see Figure 12). Since the two skin-tone groups were both recorded under LC1 to LC5 light intensities, the results of light-skinned and dark-skinned subjects under these five conditions are plotted alongside each other for ease of comparison.

Figure 12 compares the agreement between and for the subjects with different skin tones and shows that this method performs better for the light-skinned subject than the dark-skinned ones under all conditions.

3.3. Dataset Specific to This Study

This dataset allows evaluation of the performance of the four methods when the subjects are stationary, and the background is varied by step changes in the color temperature of the illumination. Experiments were performed on 8 pale-skinned and 4 dark-skinned subjects, and the results are listed in Table 7. As before, the best and the second-best results are marked with “” and “,” respectively.

Table 7 shows that when the background color temperature is repeatedly varied between two levels, the proposed method yields the best results both for the pale- and dark-skinned subjects. Figure 13 further demonstrates the consistency of the proposed method and shows the consistency for the pale-skinned subjects, with the 95% limits of agreement from -7.74 to 10.06 bpm and the mean bias of 1.16 bpm, which is marginally better than that for the dark-skinned subjects, where the corresponding values are -12.82 to 9.45 bpm for the 95% limits of agreement and -1.68 bpm for the mean bias.

4. Discussion

The results from the data obtained in our earlier study [21] show that, with sufficient illumination, POS, Project_ICA, and the proposed method, all of which are based on the skin reflection model, have better outcomes than the JBSS_EEMD approach. This implies that the skin reflection model, which relies on the projection process to weaken the contribution of the specular component of the light reflected from the skin, and thus the processed signal, can better retain the pulse rate information [19, 21]. We also note that, under adequate illumination, the performance of the POS and Project_ICA methods are slightly better than that of the proposed method.

The results from the Public Benchmark Dataset show that POS and Project_ICA performed badly when the illumination was poor. In contrast, JBSS_EEMD and the new method proposed here had relatively better outcomes, although the new method was superior to JBSS_EEMD. Not unexpectedly, under adequate illumination, the proposed method was more accurate than when operating with low illumination. For POS and Project_ICA, the projection process is applied directly to eliminate the component of the signal due to variations in background illumination, giving rise to more efficient separation of the diffuse and direct reflection components [19]. However, when there is inadequate background illumination, the camera cannot capture enough illumination and the projection process does not work well. At the same time, the JBSS preprocessing can effectively reduce some of the undesirable effects of variation in the environmental illumination and, therefore, enhance the performance of the subsequent projection process. We also found the same pattern, namely, good agreement between them for each metric, in the exercise recovery and head swinging scenarios as in the dataset from our earlier study [21].

In the results from the dataset compiled for this study, when the background illumination was repeatedly switched between two color temperatures, it was found that the proposed method had the optimum results. The main reason is that it first uses JBSS to weaken the adverse effects of changes in environmental illumination and then uses the skin reflection model to more consistently obtain the signal component which contains the pulse rate information.

Under the head movement and exercise recovery scenarios, the results from the Public Benchmark Dataset were similar to those from the dataset used in our previous report [21], and the methods based on the skin reflection model were generally better than the JBSS_EEMD. We also observed that HR errors of subjects under the head movement and exercise recovery scenarios were greater than those of the stationary subjects. We also found that when the frequency of head movements is higher, the accuracy of the four methods is lower. The reason for the poor performance in the head movement scenario is that when the subjects’ heads are moving rapidly, the detected facial ROI becomes small or is even lost entirely, which leads to the unsatisfactory acquisition of the rPPG signal. The reason for the comparatively poor performance in the exercise recovery scenario is that the subject’s HR is expected to change markedly during the measurement period thus perturbing the HR estimation over the sampling period used in this and our previous study [21]. Therefore, this variation cannot be accurately captured, and the average over this period does not represent the exercise HR or the resting value to which it returns. We also observed that POS and Project_ICA performed slightly better than the proposed method under the stationary scenario because, under the conditions of sufficient and invariant illumination, the JBSS is of limited contribution in the preprocessing.

From the results on the three datasets, it is clear that the dark-skinned subjects yield less accurate HR values than those with lighter skins. The reason is that the absorption spectrum of melanin and that of hemoglobin coincide to some extent, so the components of the light reflected from the skin which has interacted with the hemoglobin will be affected by the melanin in the skin [35]. The more melanin, the more significant the impact it has on the HR signal.

Overall, the proposed method combines the advantages of the signal decomposition method and the skin reflection model approach, in which the JBSS is used to remove the common illumination component from the facial ROI and background ROI signals, and the projection process is then used to obtain the signal that retains the pulse rate information. This technique can not only lessen the effect of poor illumination but can also provide an accurate HR measurement for subjects under conditions of inadequate and or repeated changes of illumination. However, the proposed method may be inadequate as a means of processing the facial ROI, resulting in poorer accuracy for the rapid head movement cases as well as for the subjects with more melanin in the skin. Although we have exploited three datasets that contain a variety of subjects and scenarios, there are still some limitations. The number of subjects in the Public Benchmark Dataset and the number of scenarios in the dataset compiled for this study are still too small to robustly validate the accuracy of noncontact video HR measurements under real-world conditions. In addition, the dataset compiled for this study involves a simple and repeated change in color temperature and is thus only a basic simulation of the more diverse changes in ambient illumination expected in the field. In future studies, we will investigate more realistic changes in illumination and pay further attention to the application of novel sensor technologies, such as fiber optic sensing, perhaps combining them with devices designed to detect force or biomolecules of interest, such as blood glucose or ascorbic acid [3639], although achieving this in a noncontact device would be a challenging problem.

5. Conclusions

In this study, we have proposed a new method for measuring HR using a webcam, which combines the JBSS and projection methods based on a skin reflection model. The JBSS method can effectively reduce the effect of low-intensity ambient illumination and illumination variations. The projection method based on the skin reflection model requires adequate illumination intensity to effectively extract the pulse rate information. We validated the proposed method on three datasets and compared the results with data from three other studies. Although the proposed method is not optimal under the stationary and human-computer interaction scenarios, differing from the best by a small margin, it can provide better estimates under most conditions, especially under inadequate and varying illumination. Thus, the approach proposed here has the potential to become the method of choice for the noncontact measurement of HR, because it is reasonably accurate and functions well under a variety of environmental conditions and for subjects with a range of skin tones.

Data Availability

The Public Benchmark Dataset is available at https://osf.io/rwsx6/. The datasets used in a previous study from our laboratory and the dataset specific to this study are not available at this time due to participant privacy.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the National Key Research and Development Program of China (2020YFC2004400), the National Natural Science Foundation of China (No. 61773110, No. 61374015), the Fundamental Research Funds for the Central Universities (Nos. N181906001, N172008008), and the Open Grant by the National Health Commission Key Laboratory of Assisted Circulation (Sun Yat-sen University) (No. cvclab201901).