Abstract

Optical measurement can substantially reduce the required amount of labor and simplify the measurement process. Furthermore, the optical measurement method can provide full-field measurement results of the target object without affecting the physical properties of the measurement target, such as stiffness, mass, or damping. The advent of consumer grade depth cameras, such as the Microsoft Kinect, Intel RealSence, and ASUS Xtion, has attracted significant research attention owing to their availability and robustness in sampling depth information. This paper presents an effective method employing the Kinect sensor V2 and an artificial neural network for vibration frequency measurement. Experiments were conducted to verify the performance of the proposed method. The proposed method can provide good frequency prediction within acceptable accuracy compared to an industrial vibrometer, with the advantages of contactless process and easy pipeline implementation.

1. Introduction

Vibration measurement and analysis are important tools for monitoring and characterizing the physical property and fault diagnosis of structures and machinery. Measurement and analysis results, such as vibration frequency, are important for the predictive maintenance of civil or mechanical structures.

Traditional sensors such as accelerometers, gyroscopes, strain gauges, inclinometers, and global positioning systems (GPS) have been widely used in vibration measurement. However, many conventional vibration measurement methods are both labor intensive and expensive owing to complex wiring for power supply and signal transmission, as well as installation and deployment of sensors. In addition, since these types of sensors are physically attached to the target object, the physical properties of the object, such as stiffness, mass, or damping, may be altered, especially when the target object is relatively small compared to the sensor. Alternative noncontact measurement techniques, such as laser Doppler vibrometer (LDV) [14], and optical methods including optical flow [5, 6], marker tracking [710], digital image correlation (DIC) [1113], and stereovision [11] are also used in practice. High cost of equipment and high requirement of the target surface limit the use of LDV. On the contrary, the use of optical methods in vibration frequency measurement has yielded promising results in laboratory and field experiments, providing data in temporal and spatial domains. However, optical methods require the use of a complicated image and signal analysis algorithm [14, 15] to obtain the vibration frequency, and lighting conditions are also critical in the measurement [8, 16].

The development of the depth sensor has unlocked new opportunities for researchers to utilize depth information to provide a device the capability to observe and detect real-world targets beyond human recognition; for instance, high-accuracy object recognition and tracking [17], SLAM application [1820], high security level face recognition [21, 22], augmented reality [23], human postural recognition, and distant medic [2426]. In recent years, the use of low-cost consumer level depth sensing input devices such as Intel RealSense and Microsoft Kinect have received significant research attention thereby extending the range of application of depth sensors. Vibration measurement is one of such applications [27, 28], and depth sensors are destined to play increasing important roles in the future kinematic measurement system.

In this work, we proposed a method that utilizes depth information acquired from Microsoft Kinect v2 combined with an artificial neural network which is a further development of the method [29] to predict the vibration frequency of the target. This approach has the following advantages: (a) contactless and markerless; (b) the use of preprocessing such as denoising or other signal enhancement methods are not required; (c) it does not require performing extraction first for vibration signals, that is, it predicts vibration frequency directly, thereby ensuring simple pipeline and easy deployment in real application; and (d) models are trained using pure artificial data, making this method scalable. Experiments were conducted to evaluate the validity and accuracy of the proposed method. The results obtained from the proposed method were compared with those from a conventional contact-type industrial vibrometer.

2. Materials and Methods

This section describes the pipeline implementation of the proposed method. The neural network devised for vibration frequency prediction using depth information acquired from Kinect V2, as well as its corresponding training procedure and preparation of dataset for the network, is introduced. The experimental method is presented at the end of the section.

2.1. Pipeline

First, we read out the metadata recorded by using Kinect v2 and decoded the data as depth information. The ROI for the measurement was selected from the first frame of the reconstructed depth image, as shown in Figure 1(a). Next, we read in the depth image sequence from every frame within the ROI (Figure 1(b)) and extracted the depth value of every pixel within the ROI separately along the time dimension to obtain the W × H numbers of depth information instances as shown in Figure 1(c), where W and H are the width and height of the ROI (in pixels), respectively. Then, depth information of every pixel was directly fed into the trained network to determine the predicted vibration frequency of each pixel, as shown in Figure 1(d). A histogram of the predicted frequency distribution is plotted in Figure 1(e) to quantitatively evaluate the predicted vibration frequency result, which we can be employed to obtain the overall prediction of the vibration frequency for the target ROI. In addition, we reconstructed the prediction result of each pixel to its original spatial location corresponding to the ROI as shown in Figure 1(f) for better visualization result and verification.

2.2. Network Architecture

In this work, we proposed an artificial neural network designed for vibration frequency prediction utilizing depth information. The input of the proposed network was a 1-dimensional vector with a length of 300, which is the product of the sampling rate of the Kinect depth sensor (30 Hz) and input data duration (10 s). The network was constructed using eight 1-dimensional convolution layers, and every convolution layer was followed by a batch-normalization (BN) [30] layer and rectified linear unit (ReLU) [31]. We mapped features to higher dimension as encoding and compressed them back to their original dimension as decoding. As shown in Figure 2, the number of filters in each convolution layer increased from 2 to 16 and decreased from 16 to 2 with a step of 2 such that the number of features was kept symmetrical for all convolution layers. The kernel size was 11 for all convolutional layers, and we zero-padded both sizes of the input by 5 points before every convolution layer so that the length was kept as 300. Fully connected layers were employed to capture activations from different parts of the input to 150 outputs, which is the number of possible frequency classes. The dropout [32] layer was followed by the first fully connected layer to prevent overfitting, and ReLU was also used as the activation function in all fully connected layers. The output of the fully connected layers was passed through the softmax layer, yielding probabilities associated with each possible frequency class. Skip connections [33, 34] were used in the proposed network architecture to speed up the optimization in the training stage and improve network performance.

2.3. Dataset

The proposed network was trained using a large amount of simulation depth signals to imitate real-world depth variation signals under a specific vibration frequency. An algorithm was used as the simulation signal generator, and the generator consists of two parts as shown in Figure 3: (1) generating standard sinewave with a specific frequency and (2) adding Gaussian distributed noise with varying standard deviation. The frequency of the simulation signals was in the range of 0–15 Hz with a step of 0.1, while the standard deviation of added noise was randomly selected in the range of 0.5–2.2. For each frequency step, we generated 12,000 simulation signals with a length of 10 s at a sampling rate of 30 Hz; randomly selected examples of generated simulation signals with frequencies of 3.6 Hz, 6.6 Hz, 9.6 Hz, and 13.6 Hz are plotted in Figure 4. We generated a dataset with a total of 1,800,000 instances, which was used for the training.

2.4. Training

We trained the network from scratch using the negative log likelihood loss and Adam optimizer [35], where the parameters and were set to 0.9 and 0.999, respectively. The input of the network was the generated simulation signals, while the ground truth was the corresponding vibration frequencies. The order of the generated simulation signals was shuffled before feeding into the network. The learning rate was set to a fixed value . The training was implemented on a laptop with Nvidia GTX 1060 GPU with deep learning framework PyTorch [36], which usually yields a good model within 24 h.

2.5. Experimental Method

Experiments were conducted to evaluate the performance of the proposed method, namely, verification test, steel cantilever beam measurement, and simply supported carbon plate measurement. These three experiments were conducted under a controlled laboratory condition. The measurement targets were recorded using Kinect v2. In all experiments except the verification test, an industrial vibrometer DongHua DH5906 was used as reference for comparison. The sampling rate of the depth sensor was fixed at 30 Hz, and 10 s of metadata were recorded in each experiment. The distance between measurement target and Kinect v2 in all experiments is about 50 cm. In addition, the proposed method results will be compared with FFT peak picking results from raw depth information of Kinect v2 in Section 3.4.

2.5.1. Verification Test

To verify the performance of the proposed method, a verification test was conducted using controlled excitation in a vibration test system comprising an exciter (MB Dynamics MODAL 50), an arbitrary waveform generator (RIGOL DG1022), and an amplifier (MB Dynamics SL500VCF). A modular steel structure was used as the measurement target, which was excited using precisely controlled vibration signals at different frequencies. The experimental setup is shown in Figure 5(a), and the Kinect field of view (FOV) of the RGB sensor and ROI raw depth information visualization are shown in Figures 5(b) and 5(c), respectively. Sine signals were generated every 2.5 Hz between 0 and 15 and passed through the amplifier to the exciter with minimum gain to excite the modular steel structure, and data were simultaneously recorded using Kinect. A part of the left column of the steel structure was selected as the ROI; for better illustration, three views of the ROI are shown in Figure 6. Then, the proposed method was applied to the Kinect recorded data and compared with result obtained with controlled excitation frequency.

2.5.2. Cantilever Steel Beam

A cantilever steel beam experiment was conducted to investigate the performance of the proposed method employed in vibration excited using real excitation. A steel beam was fixed at one end, while a vibrometer was attached at the free end as shown in Figures 7(a) and 8(a), which was struck at the free end using an impact hammer. The resultant vibration of the steel beam was recorded using the vibrometer and Kinect simultaneously. The sampling rate of the vibrometer was set to 30 Hz, which is consistent with the Kinect depth sensor. Two experimental cases were designed for the cantilever steel beam experiment to examine possibility of utilizing distance variation information between the depth sensor and test object and depth variation signal at the edge of test object. In case 1, the Kinect sensor was pointed in a direction parallel to the direction of vibration, while in case 2, the Kinect sensor was pointed in a direction perpendicular to the direction of vibration. In both experimental cases, the proposed method was applied to the data recorded by the Kinect, and the acceleration signals from the vibrometer were transformed to the frequency domain via fast Fourier transformation (FFT). The vibration frequency components were examined in the frequency domain and then compared with the result from the proposed method.

Experimental Case 1. The experimental setup for case 1 is shown in Figure 7(a). The FOVs of the RGB sensor and ROI raw depth information visualization are shown in Figures 7(b) and 7(c), respectively. Three views of the ROI in this experimental case are shown in Figure 9.

Experimental Case 2. The experimental setup for case 2 is shown in Figure 8(a). The FOV of the RGB sensor and ROI raw depth information visualization are shown in Figures 8(b) and 8(c), respectively. Three views of the ROI for measurement in this experimental case are shown in Figure 10.

2.5.3. Simply Supported Carbon Plate

A simply supported carbon plate experiment was also conducted for the real excitation scenario. The carbon plate was supported at both ends, while the vibrometer was placed at the midpoint of the plate, as shown in Figures 11(a) and 12(a). We struck at the left support point using an impact hammer as excitation. The resulting vibrations were recorded using the vibrometer and Kinect simultaneously. The sampling rate of the vibrometer was set to 30 Hz. Two experimental cases were also designed for the simply supported carbon plate experiment to examine possibility of utilizing distance variation information between the depth sensor and test object and depth variation signals at the edge of test object. In case 1, the Kinect sensor was pointed in a direction parallel to the direction of vibration, while in case 2, the Kinect sensor was pointed in a direction perpendicular to the direction of vibration. The acceleration signals from the vibrometer were transformed to the frequency domain using FFT and were compared with the results from the proposed method.

Experimental Case 1. The experimental setup for case 1 is shown in Figure 11(a). The FOV of the RGB sensor and ROI raw depth information visualization are shown in Figures 11(b) and 11(c), respectively. Three views of the ROI for measurement in this experimental case are shown in Figure 13.

Experimental Case 2. The experimental setup for case 2 is shown in Figure 12(a). The FOV of the RGB sensor and ROI raw depth information visualization are shown in Figures 12(b) and 12(c), respectively. Three views of the ROI for measurement in this experimental case are shown in Figure 14.

3. Results

3.1. Verification Test

We selected an excitation frequency of 5 Hz as an example; the histogram of predicted frequency distribution is plotted in Figure 15(a), the visualized predicted frequency distribution over the spatial dimension is shown in Figure 15(b), and the result with a value of 5 Hz is highlighted in Figure 15(c). The predicted frequency distribution histogram of the remaining four excitation frequencies of 2.5 Hz, 7.5 Hz, 10.0 Hz, and 12.5 Hz are plotted in Figure 16. The results of all the excitation frequencies are summarized in Table 1.

The results of the verification test indicate that the proposed method can accurately predict the vibration frequency using the Kinect depth data.

3.2. Cantilever Steel Beam

Histograms of predicted frequency result distributions for case 1 and case 2 are plotted in Figures 17(a) and 18(a), respectively. The results are visualized in their original spatial position in the ROI as shown in Figures 17(b) and 18(b), respectively, while the results with the value of 9.4 Hz and 9.5 Hz are highlighted and plotted in Figures 17(c) and 18(c), respectively. The raw vibration signals of the cantilever steel beam recorded by using the vibrometer were first normalized and plotted as time histories in Figures 19(a) and 20(a); then, the normalized power spectral density (PSD) obtained from FFT are plotted in Figures 19(b) and 20(b), respectively, to compare the peak frequency components of the predicted results.

It can be observed from the results of the experimental cases that the proposed method successfully predicted the vibration frequency and closely matched the results from the vibrometer, namely, 9.4 Hz and 9.5 Hz in case 1 and 9.5 Hz and 9.5 Hz in case 2. In addition, it can be observed that, in case 1, the pixels with the correct prediction result were from almost the entire ROI, while in case 2, the pixels with the correct prediction results were only distributed around the edge of the vibrometer and the steel beam.

3.2.1. Experimental Case 1

The cantilever steel beam experimental case 1 results from the proposed method and vibrometer are shown in Figures 17 and 18, respectively.

3.2.2. Experimental Case 2

The cantilever steel beam experimental case 2 results from the proposed method and vibrometer are shown in Figures 19 and 20, respectively.

3.3. Simply Supported Carbon Plate

The prediction result histograms of the simply supported carbon plate experimental case 1 and case 2 are shown in Figures 21(a) and 22(a), respectively and the corresponding result visualizations are shown in Figures 21(b) and 22(b), respectively. Pixels predicted as 6.8 Hz are highlighted in Figures 21(c) and 22(c), respectively, for both cases. Figures 23 and 24 show the result from the contact vibrometer; the normalized time histories of the acceleration signal for case 1 and case 2 are plotted in Figures 23(a) and 24(a), respectively, while the corresponding PSDs obtained by using FFT with peak picking are shown in Figures 23(b) and 24(b), respectively.

The results indicate that the proposed method can effectively predict the vibration frequency of the target object ROI. Furthermore, it can be observed that, in case 1, almost the entire ROI pixels were predicted as 6.8 Hz, while in case 2, only the pixels around the test object and vibrometer edge were predicted as 6.8 Hz.

3.3.1. Experimental Case 1

The simply supported carbon plate experimental case 1 results from the proposed method and vibrometer are shown in Figures 21 and 22, respectively.

3.3.2. Experimental Case 2

The simply supported carbon plate experimental case 2 results from the proposed method and vibrometer are shown in Figures 23 and 24, respectively.

3.4. FFT Peak Picking Comparison

To compare the result of the proposed method with that of the FFT peak picking method with raw depth information from Kinect, the peaks of the frequency-domain results obtained via FFT of raw distance signals in each pixel are picked as frequency prediction results for each pixel. Histogram of the predicted frequency distribution via FFT is also used to quantitatively evaluate the predicted vibration frequency result within ROI, and the highest counted prediction result will be considered to be the overall prediction of the vibration frequency for the target ROI.

The summary of the result comparison of the proposed method and FFT peak picking is shown in Table 2. For better illustration, the simply supported carbon plate experimental case 1 result is used as a representative example, the time history of raw distance from Kinect depth information of each pixel in this experiment case is shown in Figure 25(a), and the corresponding frequency domain result obtained by FFT is shown in Figure 25(b). The vibration frequency result distribution histogram is shown in Figure 26(a), and the corresponding result visualization is shown in Figure 26(b), and the pixels predicted as 6.8 Hz via the FFT peak picking method are highlighted in Figure 26(c).

It can be observed that the frequency-domain result obtained by using FFT and pixel level frequency prediction result obtained by peak picking are dominated by trivial noisy low-frequency results, as shown Figures 25(b) and 26(a). It is also noted that there are still some pixels correctly predicted as 6.8 Hz using the FFT peak picking method as shown in Figure 26(c), but when compared with result of the proposed method result as shown in Figure 21(c), the proposed method is superior to the FFT peak picking method since almost all the pixels within ROI have correct predication, while the FFT peak picking method only has correct prediction of pixels on the top left area of the ROI. The prediction result distribution histograms of the proposed method and FFT peak picking method can also quantitatively confirm it, as shown in Figures 21(a) and 26(a), respectively.

4. Discussion

The results of the experiments conducted using different excitation sources and different test objects demonstrate the performance of the proposed method. The proposed method can utilize meta depth information acquired from Kinect V2 to predict the vibration frequency of a target ROI with minor errors. A significant finding of this study is that when the Kinect was pointed in a direction parallel to the vibration direction, the depth variation signals utilized by using the proposed method were from distance variation between the test object and depth sensor of the Kinect, as demonstrated in the steel beam experimental case 1 and carbon plate experimental case 1. When the Kinect was pointed in a direction perpendicular to the vibration direction, the proposed method can still provide useable result; in this case, it utilized depth variation signals at the edge of the test object rather than distance variations between the depth sensor and test object itself, as demonstrated in the steel beam experimental case 2 and carbon plate experimental case 2. These findings confirm the reliability and applicability of the proposed method for vibration frequency measurement.

Unlike traditional optical-based noncontact vibration measurements, we used Kinect v2 and feed-forward CNN to conduct vibration frequency measurement directly, and the use of additional signal processing or image processing algorithms is not required. Furthermore, the proposed method is fast and easy to deploy in applications as it does not require the explicit extraction of vibration signals and incorporation of denoise processing into the proposed artificial neural network for meta noisy depth signals. Besides, the proposed network is trained entirely using simulation signals, which indicates that the proposed network can be easily scaled for a larger measurement range and higher measurement precision.

This method also has some drawbacks and limitations. Interference from sunlight can occur as the depth sensor of Kinect V2 is based on infrared technology. Therefore, the proposed method is limited to indoor applications. Other inherent drawbacks of the Kinect depth sensor are that the measurement distance range is restricted to 0.4–4.5 m, and the frequency measurement range is limited to within 15 Hz since the sampling rate of the Kinect depth senor is fixed at 30 Hz. Furthermore, the measurement precision of the proposed neural network is controlled by the network configuration and the dataset it trained with, and the proposed network can only detect the resonant or peak frequency, while other frequency components are undetected. The proposed method is also vulnerable to camera shake, lighting condition vibration, and other types of electrical or mechanical noise, just like all traditional optical-based methods.

The proposed method offers a new possibility for future research on optical-based vibration measurement. We can further utilize the feature extraction capability of deep neural network for optical-based vibration signal extraction and processing. A more advanced network architecture specifically designed for vibration measurement could be used in future work. The dataset of the signal generator algorithm can also be improved and can be incorporated and augmented with real sampled depth signals for better generalization.

5. Conclusions

In this paper, we proposed a method for vibration frequency measurement using Kinect V2 and artificial neural network. Experiments were conducted to evaluate the performance of the proposed method, and results show that the proposed method can provide good vibration frequency measurement results compared to those from an industrial vibrometer.

This method is limited by the inherent drawbacks of the Kinect depth sensor and the architecture of the proposed network, as it cannot detect all the frequency components of the measurement target. In the future, we will redesign and further improve the network architecture, dataset preparation process, and workflow of the proposed method to address these limitations.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

We thank Professor Zhiwei Chen and Professor Ying Lei from Department of Civil Engineering, Xiamen University, for the support of the experiments. This work was supported by the National Natural Science Foundation of China (11372074).