Abstract

In mHealth field, accurate breathing rate monitoring technique has benefited a broad array of healthcare-related applications. Many approaches try to use smartphone or wearable device with fine-grained monitoring algorithm to accomplish the task, which can only be done by professional medical equipment before. However, such schemes usually result in bad performance in comparison to professional medical equipment. In this paper, we propose DeepFilter, a deep learning-based fine-grained breathing rate monitoring algorithm that works on smartphone and achieves professional-level accuracy. DeepFilter is a bidirectional recurrent neural network (RNN) stacked with convolutional layers and speeded up by batch normalization. Moreover, we collect 16.17 GB breathing sound recording data of 248 hours from 109 and another 10 volunteers to train and test our model, respectively. The results show a reasonably good accuracy of breathing rate monitoring.

1. Introduction

The emergence of mHealth draws much attention both in industry and academy [1]. Google, Microsoft, and Apple conduct a series of work on mHealth from hardware to software. Google is the first one to get involved in mHealth. In April 2012, Google released Google Glass [2] and applied it to healthcare in July 2013 [3]. Pristine declared to develop medical application for Google Glass. After that, Google accomplished the acquisition of a biotech company Lift Labs, which invented an electronic spoon to help Parkinson patients have food. In 2015, Google X announced that it was working on wearable suits which can exam cancer cell of users. In addition, Microsoft Band, Apple Watch, Fitbit, Jawbone, and more smart wearable devices bloom up everywhere.

There exists a broad array of healthcare-related applications on sleep monitoring by smart wearable devices [4]. They often aim at fine-grained breathing rate monitoring as a kind of nonobtrusive sleep monitoring for the understanding of users’ sleep quality. Since inadequate and irregular sleep can lead to serious health problems such as fatigue, depression, cardiovascular disease, and anxiety [5], breathing rate monitoring is critical to detect early signs of several diseases such as diabetes and heart disease [6]. The breathing rate monitoring can also be applied to the sleep apnea diagnosis and treatment, treatment for asthma [7], and sleep stage detection [8]. Thus, fine-grained breathing rate monitoring is important to facilitate these healthcare-related applications.

Traditionally, one’s breathing rate can be captured by professional medical equipment as monitoring machines in hospitals. In most cases, such machines are too expensive, too complex, and too heavy for daily use for ordinary people. A possible solution is to achieve accurate sleep monitoring via smartphone or other devices with recognition algorithm [9], which is more and more popular in current healthcare-related applications. For example, Ren et al. [10] exploit the readily available smartphone earphone placed close to the user to reliably capture the human breathing sound. It cannot work if the earphone is apart from the user. Liu et al. [11] tracks the vital signs of both the breathing rate and the heart rate during sleep, by using off-the-shelf WiFi without any wearable or dedicated devices. However, the wearable devices cannot achieve approximative performance in comparison to professional medical equipment. The latter often has a much lower signal-to-noise ratio (SNR) and can achieve a much higher accuracy in breathing rate monitoring.

In this paper, we aim at developing a fine-grained breathing rate monitoring algorithm that works on smartphone and achieves professional-level accuracy. We propose a deep learning model such as DeepFilter, which can filter the breathing from low SNR data. We empirically exploit the framework of deep learning and apply it to the fine-grained breathing rate monitoring on smartphone. The deep learning model combines several convolutional layers and a bidirectional recurrent layer and is trained in an end-to-end manner using the cross entropy loss function. In addition, batch normalization is adapted to speedup the training. Moreover, we collect 16.17 GB breathing sound recording data of 248 hours from 109 and another 10 volunteers to train and test our model, respectively. The results show a reasonably good accuracy of breathing rate monitoring.

The main contributions of this paper are highlighted as follows:(i)As to our best knowledge, we are the first to apply deep learning to fine-grained breathing rate monitoring, with low SNR data recognition.(ii)We run real experiments on smartphone and verify the availability and performance of our model, which directly promotes the sleep monitoring applications in our daily lives.

Since our scheme involves accurate sleep monitoring and deep learning, we mainly discuss the previous work on the two aspects.

Medical-based sleep-monitoring systems are often developed for clinical usage. In particular, polysomnography [12] is used in medical facilities to perform accurate sleep monitoring by attaching multiple sensors on patients, which requires professional installation and maintenance. It can measure many body functions during sleep, including breathing functions, eye movements, heart rhythm, and muscle activity. Such systems incur high cost and are usually limited to clinical usage. DoppleSleep [13] is a contactless sleep sensing system that continuously and unobtrusively tracks sleep quality, by using commercial off-the-shelf radar modules.

Some smartphone apps, such as Sleep as Android, Sleep Cycle Alarm Clock, and iSleep [14], can perform low-cost sleep monitoring by using the smartphone built-in microphone and motion sensors. These apps, however, only support coarse-grained monitoring, such as the detection of body movements, coughing, and snoring [15], and utilizes the phone usage features such as the duration of phone lock to measure sleep duration. The Respiratory app [16] derives a person’s respiratory rate by analyzing the movements of the users’ abdomen when placing the phone between the users’ rib cage and stomach. ApneaApp [17] is a contactless sleep apnea event detection system that works on smartphone, which does this by transforming the phone into an active sonar system that emits frequency-modulated sound signals and listens to their reflections to capture the breathing movement. Ren et al. [10] exploit the readily available smartphone earphone placed close to the user to reliably capture the human breathing sound. Liu et al. [11] propose to track the vital signs of both the breathing rate and the heart rate during sleep by using off-the-shelf WiFi without any wearable or dedicated devices. There is still a gap between the performance of professional equipment and that of the approaches above.

Recently, deep neural networks are first used for better phone recognition [18], in which traditional Gaussian mixture models are replaced by deep neural networks that contain many layers of features and a very large number of parameters. Convolutional networks have also been found beneficial for acoustic models. Recurrent neural networks are beginning to be deployed in state-of-the-art recognizers [19] and work well with convolutional layers for the feature extraction [20]. We are inspired by the good performance of the previous work on speech recognition and introduce deep learning algorithm into the problem of fine-grained breathing rate monitoring [21].

3. DeepFilter

In this section, we introduce the whole framework of DeepFilter and investigate the training of the model in detail.

3.1. The Framework

Figure 1 shows the framework of DeepFilter. Our model is a RNN that begins with several convolutional input layers, followed by a fully connected layer and multiple recurrent (uni- or bidirectional) layers, and ends with an output layer. The network is trained end to end and is added batch normalization with cross entropy loss function.

In speech recognition, “end-to-end” is often used to support the training without aligned training labels, which does not involve the frame-level cross entropy loss function. In breathing rate monitoring, we first create frame-aligned training labels and translate them into a classification problem to decide whether the frames belong to inhaling/exhaling or not. Since one inhaling/exhaling event may involve several frames, we exploit recurrent layers to process input sequence, for the recurrent layers can capture the sequence information to improve the performance. Thus, the input/output sequence of our model is similar to that of speech recognition. The only difference is that the input and output sequences of speech recognition have different lengths while that of our model have the same length.

To one sample and label , sequence frames are sampled from training set , which generates some voice recordings of size N. We assume that Xi is a sound recording for 136 seconds (the sampling rate is 44100 Hz), and the duration of each frame is 40 ms (which is an empirical value that always used as window size in speech recognition) It can be divided into samples , and each sample combines frames. And one frame is a dimension vector . Thus, a voice recording of 136 seconds can be divided into samples (136/2 = 68, 40 ms50 = 2 s). Each sample has a corresponding label sequence , , in which a frame without breathing is set to 0, and otherwise is set to 1. Generally, the goal of our processing is to convert an input sequence into a 0-1 sequence.

The data described above are suitable as input for a RNN. However, our model is a RNN with several convolutional input layers, which requires the input to be in a two-dimensional structure. Thus, we split one 40 ms frame into 4 frames with 10 ms. Each 10 ms frame is translated from time domain to frequency domain through FFT, which produces a 220-dimension vector. Now, we translate a one-dimensional 40 ms frame into a two-dimensional spectrogram with the size of 2204.

The main idea of our scheme is to differ the breathing events from the low SNR recordings. It needs to support the high-frequency signals for learning the fine-grained features in deep learning model. Thus, we use the sampling rate of 44100 Hz, which is the highest sampling rate of most smartphones on the market.

3.2. Batch Normalization for Deep Bidirectional RNNs

To efficiently absorb data, we increase the depth of the network by adding more convolution and fully connected layers. However, it becomes more challenging to train the network using gradient descent as the size and the depth increase; even the Adagrad algorithm could achieve limited improvement. We add batch normalization [22] to train the deeper network faster. Recent research has shown that batch normalization can speed convergence, though not always improving generalization error. In contrast, we find that when applied to very deep RNNs, it not only accelerates training but also substantially reduces final generalization error.

In a typical feed-forward layer containing an affine transformation followed by a nonlinearity , we insert a batch normalization transformation by applying , wherein which the terms E and Var are the empirical mean and variance over a minibatch, respectively. The learnable parameters γ and β allow the layer to scale and shift each hidden unit as desired. The constant ε is small and positive and is included only for numerical stability. In our convolutional layers, the mean and variance are estimated over all the temporal output units for a given convolutional filter on a minibatch. The batch normalization transformation reduces internal covariate shift by insulating a given layer from potentially uninteresting changes in the mean and variance of the layers’ input.

A recurrent layer is implemented aswhere and are computed sequentially from to and from to , respectively. And the (nonrecurrent) layer takes both the forward and backward units as inputs , where , and the activation function is the clipped ReLu.

There are two ways of applying batch normalization to recurrent operation:where the first one indicates that the mean and variance statistics are accumulated over a single time step of the minibatch, which is ineffective in our study. We find that the second one works well, and Cooijmans et al. [23] have explained the reason.

3.3. Convolutions

Temporal convolution is commonly used in speech recognition to efficiently model temporal translation invariance for variable length utterances. Convolution in frequency attempts to model spectral variance due to speaker variability more concisely than what is possible with large fully connected networks. We experiment with the use of convolutional layers from one to three. These are both in the time and frequency domains (two-dimensional) and in the time-only domain (one-dimensional).

Some previous works [24] demonstrated that multiple layers of two-dimensional convolution improve results substantially than one-dimensional convolution. And convolution has good performance on noisy data. A key point and necessity of low SNR data recognition is denoise [24]. Thus, the convolution component of the model is the key point for it to work well on low SNR data.

3.4. Bidirectional

Recurrent model with only forward recurrences routinely performs worse than similar bidirectional ones, so that implying some amount of future context is vital for good performance. However, bidirectional recurrent models are challenging to be deployed in an online, low-latency setting because they cannot stream the transcription process as the utterance arrives from the user. The DeepSpeech2 of Baidu [24] supports a special layer called lookahead convolution, which is a unidirectional model but without any loss in accuracy comparing to a bidirection one. In our study, we use the bidirectional recurrent model because we do not have the constraint of working in real time.

4. Fine-Grained Breathing Rate Monitoring

4.1. Training Data

Large-scale deep learning systems require an abundance of labelled data. For our system, we need a lot of labelled low SNR data, but it is difficult to label the low SNR data. To train our large model, we first collect high SNR data (the data collected in a quiet surrounding in which breathing can be heard clearly), which are much easier to be labelled. After that, we combine the labelled data with pink noise to lower the SNR of the data through data synthesis, which has been successfully applied to data extensions of speech recognition.

Figure 2 is one high SNR sound recording of 45 seconds. The 11 red boxes mark 11 expirations, and the black, green, and blue boxes mark 2 inspirations, 8 silences, and 3 noises, respectively. When we enlarge the frames of expiration, inspiration, silence, and noise, it is obvious that the differences between adjacent points in a silence frame is much lower than that in the other three types of frame on average. And its amplitudes are also minimum, only . Then, we give each frame a frame index as follows:where is the ith frame in data, . The subfigure on the bottom of Figure 2 shows the label indexes of one sound recording (x-axis is frame and y-axis is the corresponding frame index). To facilitate observation, we add green lines on the frames whose index is larger than the threshold. It makes the label indexes to distinguish the silence frame from the other three types of frame well. For labelling the data, we erased the inspiration and noise manually (since it is a high SNR data, the volunteers are required to keep quiet during recording), and the threshold is also given manually. All the above actions are conducted in a sound processing software Audacity [25], and the thresholds and labels are determined on Matlab.

It took us 3 weeks for labelling about 7458 sound recordings (one recording is about 2 minutes, and the total recordings from 109 people are about 248 hours). Finally, we use Audacity to complete the data synthesis discussed above.

During the labelling, we find that the number of breathing frames is larger than that of nonbreathing frames, which may reduce the classification performance. Thus, we add a weight in loss function as follows:

4.2. Postprocessing

Figure 3 is a snapshot of 1000 continuous frames randomly chosen from training data, in which the top figure is the ground truth, while the bottom figure shows the recognition results from our model (the x-axis is the frame and the y-axis is the label value). As shown in the figure, breathing is continual and periodical, while the incorrectly labelled data and false-recognized frames are abrupt and discrete. Thus, we can define postprocessing as follows:

First, we can regard a breathing event as the continuous frames with breathing label. Thus, we delete the breathing events whose frame number is less than a threshold, which is the key point of postprocessing. In this study, we choose a value of 50 because a breathing time less than 0.5 s is abnormal according to our tests. As shown in Figure 3, the postprocessing could remove the green dotted-line cycle. The effect of postprocessing is shown in Table 2,and the values of TTR (Test ground-Truth Rate) can demonstrate its efficiency (TTR is a metric described in Section 5.3).

5. Experiment

5.1. Training

We train our deep RNN on 2 work stations with the configuration as follows: Intel Xeon processor E5-2620, 32 GB memory, standard AnyFabric 4 GB ethernet adapter, and Nvidia GTX Titan GPU with 12 GB memory. We use a PC as parameter server and 2 work stations as workers. Each individual machine sends gradient updates to the centralized parameter repository, which coordinates these updates and sends back updated parameters to the individual machines running the model training. We use a public deep learning library Tensorflow[?] to implement our system.

There are four deep learning models trained in our study. The baseline model is a unidirectional RNN with 4 hidden layers, and the last hidden layer is a recurrent layer. The framework is 882-2000-1000-500-100-1, and learning rates without momentum. Training data are described in the previous section, and we take 40 ms as a frame and 50 frames as a group (). Other three models are also RNNs but begin with convolution layers. The detailed model parameters are listed in Table 2. The third line in the table is the number of hidden layers, respectively. The convolution in one dimension and that in two dimensions has different inputs. For the one-dimensional convolution, the input is a 40 ms frame and is translated to a frequency domain with 882 dimensions. The input of the two-dimensional convolution is a 40 ms frame too and is translated into a spectrogram (4 × 220). Then, one or two convolutional layers are followed with a mean pooling layer, and the mean pooling size is ( for one-dimensional mean pooling). All models have three fully connected layers, and each layer has 512 units. They are ended with two unidirectional recurrent layers except DeepFilter 3, which is ended with one bidirectional recurrent layer. All the models are trained through Adagrad with an initial learning rate .

5.2. Experimental Data

Figure 4 shows the procedure of data collection. In Figure 4(a), the volunteer sits in front of the desk, and four smartphones are placed on the desk with a distance of 0.2 m, 0.4 m, 0.6 m, and 0.8 m from the margin of the desk, respectively (the distances from the volunteer’s nose to smartphones are far enough, which reach 0.6 m, 0.85 m, 1.0 m, and 1.2 m, resp.). The further the distance is, the lower the SNR is, and the more difficult the labelling of data is. We make four smartphones to possess the same label by collecting data in synchronization, while it is easy to label the nearest one. In Figure 4(b), the volunteer sits in front of the desk, with four smartphones on the desk with a distance of 0.4 m from the margin of the desk, with 4 different angles ±30° and ±60°. We collect 10 volunteers’ breathing data, each including 2 minutes tests as Figures 4(a) and 4(b), respectively, and label them finally.

We find some differences on the smartphone with different manufacturers. A funny discovery is from iPhone 4s. iPhone 4s has much worse ability to collect sound recording like breathing, since the built-in microphone has a filter function that can filter the low-frequency noise before recording. This function is developed to improve the quality of voice conversations. We test some smartphones from different manufacturers such as OnePlus, Huawei Honor, Mi, Meizu, Oppo, and Vivo and finally find that 4 Mi smartphones can collect more intact data. Consequently, we choose 4 Mi smartphones with the same band, for removing the unexpecting hardware diversities in experiments.

5.3. Results

There are four metrics to measure the performance of an algorithm in Table 1. TPR (true positive rate), TNR (true negative rate), WAR (whole accuracy rate), and TTR. There are two classes of samples in our data set: breathing frames (positive samples) and nonbreathing frames (negative samples). TPR is a recognition accuracy on positive samples, while TNR is a recognition accuracy on negative samples. And WAR is the recognition accuracy on the whole data set. After recognizing from deep learning model, breathing frames are calculated into TTR. Here, TTR is a measure of breathing rate, which is defined as , where “a” is the breathing rate calculated by postprocessing and “b” is the ground-truth breathing rate. Table 1 also lists the TPR and TNR of four distances respectively, and it shows that the number of positive samples is quadruple the number of negative ones. That is to say the recognition accuracy of negative samples is much higher than that of positive samples.

We can obtain five results from Table 1 as follows: first, the deep learning models exhibit advantages on precise recognition in comparison to SVM and LR. The superiority increases with the decrease of SNR of the data. Second, the convolution exhibits good performance according to the results of our models and baseline. The two-dimensional convolution is better than the one-dimensional convolution, which demonstrates the ability of feature representation of convolution. And, the frequency domain (two-dimensional) provides more information than the time domain (one-dimensional). It is said that the convolutional layer brings the most improvement on accuracy of recognition. Third, the bidirectional recurrent layer is better than the unidirectional one, according to the results of DeepFilter 2 and DeepFilter 3. It means that not only the history information but also the future information can improve the accuracy of recognition. In practice, the improvement of the bidirectional recurrent layer is limited, which is not much than that of convolution. Fourth, the results of baseline, DeepFilter 1, DeepFilter 2, and DeepFilter 3 demonstrate that the convolution is performed well in both time and frequency domains. And the performance of negative samples (TNR) is much better than that of positive samples in deep learning models. Finally, DeepFilter 3 obtains the best result in most cases, especially for lower SNR data. And the results from DeepFilter 1 to DeepFilter 3 demonstrate that the convolutional layer and the recurrent layer are not conflicts but boost each other in our problem.

In Table 1, there are significant differences between 40 cm and 60 cm with respect to the recognition accuracy. Since the SNR of breathing recordings decreases exponentially with the increasing of distance, TTR can directly indicate the performance of fine-grained breathing rate monitoring. As we see, the TTR of DeepFilter 3 is close to that of DeepFilter 2, and the TTR values of 6 algorithms are less than other three metric values. It implies that one breathing event is separated by some misclassified frames, in which each part is less than 5 frames. Since such breathing events are removed by postprocessing, the TTR values of 6 algorithms are less than other three metric values.

Table 3 lists four related vital sign monitoring methods, including ApneaAPP [17], FMBS [10], WiFi-RF [11], and DoppleSleep [13]. All the four vital sign monitoring methods belong to contactless sleep monitoring systems, which involve breathing frequency detection. The column “Modalities” lists the method of the system used. “RiP-L” and “RiP-U” are the lower bound and the upper bound of accuracy of the system in the corresponding papers, and “bmp” is the unit which means the difference between the rate detected by the system and the actual rate. “RiO-TTR” is the result that we reproduce in the four methods with our data on the metric “TTR.” Different from our work, WiFi-RF and DoppleSleep need extra device to assist breathing monitoring while our system only requires the off-the-shelf smartphone. ApneaAPP uses frequency-modulated continuous wave (FMCW) to capture the breathing frequency. We reproduced the method but obtained poor results (as shown in the first line of the last column of Table 3, 0.2 means less than 0.2). The reason lies in that FMCW needs the device to transmit ultrasonic, and the higher the frequency of ultrasonic is used, the better the results are found. Most smartphones only support ultrasonic less than 20 kHz, which cannot provide enough accuracy in practical (only 0.2 accuracy can be achieved in our reproduced scheme, which is much lower than the 0.98 listed in paper). FMBS is the only method using voice recording, which is most similar to our work. FMBS uses earphone to reinforce voice recordings of the users during the night and adopts envelope detection to assist the breathing detection. But, it is a coarse-grained breathing detector, which only achieves a low TTR value as 0.53 when it is running on our voice recordings. As so far, there are no any other works using smartphone on fine-grained breathing rate monitoring of voice recordings. WiFi-RF and DoppleSleep use WiFi radio frequency and extra device to capture breathing, respectively, which may achieve acceptable accuracy but are not suitable for breathing monitoring based on smartphones.

We can find that the TTR of DeepFilter is 0.77 in Table 1, which is better than 0.53 of FMBS [10]. It means our deep learning algorithm can achieve comparable performance of several professional devices.

We also list the recognition accuracies of four distances in Figure 4(a) and four angles in Figure 4(b). As shown in Table 1, the accuracy is reducing with the raising of distance. An interesting phenomenon of accuracy is dramatically declined on 80 cm, which may be an explanation of critical point of SNR. We will conduct further study on it in our future work. And the results of accuracy ratio affected by different angles are listed in Table 4, which shows a tiny influence on accuracy by different angles.

The precision of polysomnography requires over 99% which is the standard in sleep quality monitoring. In practice, DeepFilter can achieve 90% accuracy in 1 meter but lower than 80% in 2 meters. The 2-meter monitoring distance is sufficient for most fine-grained breathing rate monitoring applications. In most cases, the apps can work well within 1 meter as monitoring distance.

5.4. Realistic Test

Figure 5 shows the procedure of realistic sleep test, and the results are shown in Table 5. The volunteer lies on the bed, and the smartphone is placed at the side of the pillow. The smartphone records the breathing of the volunteer, and another camera records the procedure. The test lasts for 7 hours, and finally we collect enough dirty data. It includes the snore, the voice of turning, and other unknown voices from outside. We choose three voice clips that include three poses shown in Figure 5 and obtain the relatively clean data (the environment is usually quite during sleep, so that high SNR data are easy to find). The three clips last for 23, 25, and 22 minutes, respectively. We labelled the data by hand and run our model to validate the effectiveness. The results shown in Table 5 prove that our method achieves fine-grained breathing rate monitoring in realistic sleep. And we will conduct more realistic experiments on smartphones and other mobile devices in the near future.

6. Conclusion

In this paper, we try to apply deep learning as fine-grained breathing rate monitoring technique to smartphones for people’s daily sleep quality monitoring. We propose DeepFilter, a bidirectional RNN stacked with convolutional layers and speeded up by batch normalization, to perform this task. The desirable results of experiments and realistic sleep test prove that our model achieves professional-level accuracy, and deep learning-based breathing rate monitoring apps on smartphones are promising. Our work also extends the use of deep learning to low SNR data recognition, which is inspiring for more data-processing applications.

In our future work, we will exploit more data to train DeepFilter. It implies that DeepFilter needs to suit more smartphones from different manufacturers. And more robust algorithms of postprocessing should be developed. Then, we will also try more deep learning models to solve our problem. And we will deploy the approaches on smartphones and other mobile devices in the near future.

Conflicts of Interest

The authors declare that they have no conflicts of interest.