#### Abstract

Image-based measurement has received increasing attention as it can substantially reduce the cost of labor, measurement equipment, and installation process. Instead of using optical flow, pattern, or marker tracking to extract a displacement signal, in this study, a novel noncontact machine learning-based system was proposed to directly predict vibration frequency with high accuracy and good reliability by using image sequences acquired from a single camera. The performance of the proposed method was demonstrated through experiments conducted in a laboratory and under real-field conditions and compared with those obtained using a contacted sensor. The vibration frequency prediction results of the proposed method are compared with industry-level vibration sensor results in the frequency domain, demonstrating that the proposed method could predict the target-object-vibration frequency as accurately as an industry-level vibration sensor, even under uncontrollable real-field conditions with no additional enhancement or extra signal processing techniques. However, only the principal vibration frequency of a measurement target is predicted, and the measurement range is limited by the trained model. Nonetheless, if these limitations are resolved, this method can potentially be used in real engineering applications in mechanical or civil structural health monitoring thanks to the simple deployment and concise pipeline of this method.

#### 1. Introduction

Vibration signals contain key information for fault diagnosis and health condition monitoring in mechanical or civil structures, such as high-rise buildings, long-span bridges, transmission towers, electricity pylons, and high cranes. As the vibration signals can provide critical information related to structural dynamics, such as the mass properties, stiffness, and their corresponding distributions, their measurement and analysis are important. Modal parameters, the most important being resonant and natural frequencies, obtained from long short-term monitoring can be used in design validation, early fault warning, and preventive maintenance of mechanical or civil structures.

Generally, the vibration measurement techniques can be categorized as contact and noncontact methods. Contact vibration measurement methods include techniques that use conventional sensors, such as accelerometers, strain gauges, velocity transducers, and global positioning systems (GPS) [1, 2]. In fact, the use of contact sensors is a labor-inducing and time-consuming task, especially for those large-scale or high-rise structures, in which the installation of such sensors or the wiring of power or data transmission cables is complicated and impractically expensive. The noncontact methods use different types of electromagnetic radiations to obtain vibration signals. For example, a laser Doppler vibrometer (LDV) [3–6] utilizes laser, interferometry techniques [7, 8], and microwaves, and image-based methods [9–25] use cameras to capture visible light. Image-based measurement methods have undergone rapid development in recent decades with advancements in both image sensor performance and image processing algorithms. Such methods are also called vision-based, photometry, or computer vision measurement methods and are mainly divided into three categories based on the type of measurement target: digital image correlation (DIC), target tracking, and nontarget methods. DIC is a technique that utilizes the brightness variations of specific speckle patterns [26–31]; the patterns must be printed or projected on the target object prior to the measurement. The displacement of the interest region can be calculated from the correlation between the current and reference frame patterns. Target-tracking techniques require an optical target, such as an LED light, a high-contrast or high-reflective marker mounted or printed on the measurement target [22, 32, 33]. The targets are found and tracked using an image processing algorithm to obtain a vibration signal. For the nontarget methods, no optical target is required on the measurement target, and edge detection, feature detection, blob detection, template matching, and other computer vision techniques are employed to capture the natural internal feature on the measurement target surface. Then, image processing and signal analysis algorithms are used to obtain vibration signals or modal parameters.

Recently, Liu and Yang [34] proposed an image-based nontarget vibration frequency measurement method by using the concept of machine learning instead of using image processing techniques, which explicitly extract vibration signals. This method can directly predict vibration frequencies. However, this method is not robust enough in the unfavorable noisy real-world condition and additional edge enhancement must be applied to obtain a good result. Following the concept of adopting machine learning in image-based measurement, this paper presents a new idea of using a confidence kernel for redesigning a neural network to achieve robust vibration frequency prediction. The proposed method has several advantages over the method by Liu and Yang [34] as well as other image-based methods; that is, the proposed method produces almost no trivial result at the pixel level, and the output result map is more obvious and accurate. In particular, the histogram for predicted frequency distribution and additional edge enhancement is not required. Moreover, the redesigned network and dataset are more robust to noise and have a much faster convergence speed. Compared to other image-based methods, the proposed method can provide pixel-level vibration frequency without explicitly extracting the vibration signal and no additional image or signal processing algorithm is required. The advantages of the proposed method (e.g., nontarget, noncontact, low cost, concise algorithm, and easy deployment) could potentially popularize it in real-field engineering applications.

#### 2. Methods

In this section, we present the proposed method and experimental setup for validating the method. Then, the artificial neural network utilizing a confidence kernel is introduced, and the kernel’s detailed algorithm is described. Then, this section describes the dataset preparation and training procedure. Lastly, we present the pipeline of the proposed method and experimental setup.

##### 2.1. Neural Network Architecture

Here, we propose a neural network with a confidence kernel, carefully devised for image-based vibration frequency measurement. Figure 1 depicts the overall architecture of the proposed network, confidence kernel, and the computational relationship between them. The main task of the proposed network is to classify input information into different classes, which are the exact values of the predicted frequencies. The proposed network consists of sets of 1D convolution layers, and every convolution layer is followed by batch normalization (BN) [35] and a rectified linear unit (ReLU) [36]. Skip connections [37, 38] are assumed to speed up the training process and improve the performance and are more suitable in a symmetric designed network. We added skip connections in our network, represented by the red “+” symbol in Figure 1(a). Two fully connected layers were applied to connect all input neural nodes to the number of measurement frequency range (MFR), which is the product of the frequency range and the reciprocal of the measurement frequency precision step. The fully connected layers serve as high-level reasoning in the neural network correlating the activation from different parts of the signal. To prevent overfitting, dropout [39] was applied to randomly set 50% of the neural nodes to zero between these two fully connected layers, since overfitting can easily happen in dense layers like a fully connected layer. The MFR of the outputs from a fully connected layer is provided to a softmax function yielding vibration frequency prediction. All the results of the pixel-level vibration frequency prediction were reconstructed to the spatial position of the original image to generate the initial unconfident result map. The confidence kernel was applied to the unconfident result map to obtain the final result map, as shown in Figure 1(b). Section 2.2 details the confidence kernel.

**(a)**

**(b)**

The input of our neural network is a 1D vector of size , where is determined by sampling rate fs and duration of the original input video clip: . The input was filtered by eight 1D convolution layers with kernels of size 11. The number of filters gradually increased from 2 to 8 and then decreased from 8 to 2, as shown in Figure 1(a). In addition, we padded the input of every 1D convolution by 5 pixels, with 0 on both sides to retain the length of the input signal.

##### 2.2. Confidence Kernel

Liu and Yang [34] presented trivial pixel-level results over the whole output result map. The trivial results were the random vibration prediction output values, which were incorrect and irrelevant. In the complex measurement condition, an additional edge enhancement algorithm must be applied to obtain a reliable measurement result. In the current study, a confidence kernel was introduced and combined with the redesigned neural network to solve this problem.

As illustrated in Figure 2, the confidence kernel is a small matrix, with the confidence kernel size (CKS) shifting by stride over the whole input. For each confidence kernel, kernel elements were collected from the unconfident result map, and the number of occurrences of each result value in the kernel was counted as . This study introduces the concept of confidence value, defined as , which for each result, is expressed as where will be compared with a tunable parameter confidence threshold (CT). If , result value will be decided as the confidence kernel output; otherwise, the output will be decided as 0.

For better demonstration, an example is presented to show the decision-making process of the confidence kernel. In this example, the CKS and stride were set to 3 and 1, respectively, and the confidence threshold was set to . The first four confidence kernels are marked in different colors in Figure 2. The first confidence kernel is represented by the blue square, comprising five types of result values: 5, 13, 16, 28, and 46 Hz, and the corresponding numbers of occurrences: 1, 2, 4, 1, and 1; hence, the corresponding confidence values are , 2/3^{2}, 4/3^{2}, 1/3^{2}, and 1/3^{2}, which are calculated to be 0.11, 0.22, 0.44, 0.11, and 0.11, respectively. The result of 16 Hz has the highest confidence value among all the confidence values in the blue kernel, while it is smaller than confidence threshold CT, which was set at 0.5 in this example. Thus, the results in the blue kernel are assumed to be trivial, and the output of this blue kernel is decided as 0. Then, the stride of 1 pixel was used to obtain the second confidence kernel (marked in red). Similar to the first kernel, three types of results are obtained through the second kernel, namely, 5, 13, and 16 Hz, with the corresponding confidence values CV of 0.33, 0.11, and 0.55, respectively. The highest confidence value is larger than the confidence threshold . Therefore, the red-colored confidence kernel is believed to be reliable, with the output confidently decided as 16 Hz. The same decision-making rule was applied to the blue- and yellow-highlighted kernels, as shown in Figure 2, yielding outputs of 16 and 0 Hz. The following confidence kernels use the same rule to decide the output results. All the outputs from the confidence kernels are provided to the corresponding center position in the final result map, as shown in Figure 1(b).

##### 2.3. Dataset and Training

The proposed neural network was trained with the simulated data. All the data used in training and testing stages were simulated through an algorithm using a random process. To simulate real-world vibration signals, we used simulated signals with preset frequency and added random noises, as illustrated in Figure 3. The random noises were sampled from a zero mean Gaussian distribution with standard deviation varying randomly between 0.5 and 2.2. The mathematical model of the simulated signals is defined in Equation (2), where is the discrete time value with even steps of 1/fs increase from zero to the end of signal duration; fs is the signal sampling rate; is the specific frequency of the simulated signals; and is the signal amplitude. Furthermore, GN is the Gaussian noise randomly sampled from Gaussian distribution, and its probability density function is given in Equation (3), where GN represents the noise amplitudes and is the standard deviation. All the simulated signals were normalized to −1 to 1 before training or testing the neural network.

**(a)**

**(b)**

To achieve better generalization capability of the proposed network, we simulated 8000 signals with different noises for each frequency precision step. For example, the vibration range was 0–50 Hz with a frequency precision step of 0.1 Hz, MFR of 500, and 4,000,000 signals were generated for training the proposed network.

We used negative log likelihood loss and the adaptive moment estimation (Adam) [40] optimizer in our training. The learning was fixed at with a batch size of 4096. The training was performed using PyTorch [41] on an nVidia GTX1060 and usually converged to a good model within 8 h.

##### 2.4. Pipeline

After selecting the region of interest (ROI) from the original video recorded in experiments, the brightness value of each pixel was readout along the time dimensions and saved as individual brightness variation signals. For instance, Figure 4(a) shows an ROI with a width of pixels and a height of pixels. In addition, the duration of the input video is with sampling rate fs. We will get numbers of readout signals with length *N*. All the signals were fed into the proposed neural network with the confidence kernel, directly yielding a pixel-level vibration frequency prediction result map, as shown in Figure 4(b).

**(a)**

**(b)**

**(c)**

To better illustrate the proposed method, a flowchart of the proposed method is shown in Figure 4(c). From the video recording to the frequency prediction result map, these steps are concise and straightforward. Simply by reading out the brightness variation signals within the ROI and feeding the neural network with the confidence kernel, we can directly obtain the frequency prediction result map without the need for additional algorithms or signal processing techniques.

##### 2.5. Experimental Methods

To compare with a previous work, the same experiments used by Liu and Yang [34] were performed, including a verification test, laboratory experiment, and field experiment. The image sequences were captured by an industrial camera with an Aptina MT9P031 image sensor with a 10 s duration at a 100 Hz sampling rate in all the experiments. In addition, an industrial vibrometer DongHua5906 was used as reference to compare the vibration frequency prediction result obtained through the proposed method. The frequency measurement range was 0–50 Hz, and the measurement precision was 0.1 Hz.

In this work, we used the same raw video clips used by Liu and Yang [34], and the vibration prediction results obtained using the proposed method were compared with those by Liu and Yang [34], without additional edge enhancement.

###### 2.5.1. Verification Test

The verification test was conducted using a vibration test system, consisting of a waveform generator (RIGOL DG1022), an amplifier (MB Dynamics SLV500VCF), and an exciter (MB Dynamics MODAL 50). A modular steel structure was excited by a standard sine signal from 0 to 50 Hz in steps of 5 Hz. The ROI was selected near the exciter with a resolution of pixels, as shown in Figure 5. The proposed method was applied to the video captured from this setup and compared with the result obtained by Liu and Yang [34].

**(a)**

**(b)**

**(c)**

###### 2.5.2. Laboratory Experiment

The first experiment was conducted in a laboratory with adequate lighting and stable background of the measurement target, as shown in Figure 6. A simple supported carbon plate was used as the measurement target, and the vibrometer was attached on its midpoint. The carbon-plated vibrations excited using an impact hammer on the left support were recorded using the vibrometer and industrial camera. The midpoint of the plate was selected as the ROI with a resolution of . The proposed method was applied to the video captured from this setup and compared with the result obtained in [34] and through the vibrometer.

**(a)**

**(b)**

###### 2.5.3. Field Experiment

The field experiment was conducted on Wuyuan Bridge, Xiamen. As the vibration frequency of the bridge cable is an important parameter for predicting the tension force of the structural components and modal shape, a bridge cable was selected as the test object. The excitation sources in the field experiment were the randomly passing vehicles and pedestrians. The vibrometer was attached on the cable near the bridge fence for easy installation, and the ROI with a resolution of was selected on a higher region of the cable to avoid inference from passing pedestrians and vehicles, as shown in Figure 7. The proposed method was applied to the video captured from this setup and compared with the results obtained in [34] and through the vibrometer.

**(a)**

**(b)**

**(c)**

#### 3. Results

##### 3.1. Verification Test

The frequency prediction result maps of 15, 30, and 45 Hz, obtained in [34] and by using the proposed method are shown in Figure 8; they are organized side by side for a better comparison. For quantitative comparison, Figure 9 shows the corresponding histogram of the predicted frequency distribution pixel counts from both methods. The frequency measurement results from nine excitation frequencies are summarized in Table 1: 5, 10, 15, 20, 25, 30, 35, 40, and 45 Hz.

**(a)**

**(b)**

**(c)**

**(a)**

**(b)**

Compared to that by Liu and Yang [34], the result map obtained using our proposed method is visually clean and only a slight trivial result was generated, as shown in Figure 8(c); the map is less noisy than that in Figure 8(b). We can easily read the pixel-level vibration measurement result as the trivial results are removed and set to 0. Figure 9 quantitatively confirms that the proposed method almost generates no trivial results like the method by Liu and Yang [34], and more pixels were confidently predicted around the excitation frequency.

##### 3.2. Laboratory Experiment

Figure 10 shows the result map of frequency prediction obtained using the method in [34] and the proposed method. The corresponding histograms of both laboratory experiments are plotted in Figure 11. Figure 12 shows the vibration measurement result obtained through the industrial vibrometer; the normalized vibration signal is plotted in Figure 12(a), and the power spectrum density (PSD) via fast Fourier transform (FFT) with peak picking is shown in Figure 12(b).

**(a)**

**(b)**

**(c)**

**(a)**

**(b)**

**(a)**

**(b)**

Figure 10(b) shows that the map of the frequency prediction results obtained using the proposed method is almost free of trivial results. Moreover, the histogram of the predicted frequency distribution in Figure 11(b) confirms this result with respect to the output of the frequency around a single value. In particular, the direct result from the proposed method is 12.7 Hz, which is closer to 12.75 Hz obtained through the vibrometer than the direct result obtained using the method by Liu and Yang [34], i.e., 12.4 Hz, as shown in Figures 11(b), 12(b), and 11(a).

##### 3.3. Field Experiment

In the field experiment, the direct output of the prediction result map of [34] and the proposed method were compared, as shown in Figure 13. The corresponding histograms of the predicted frequency distributions are plotted in Figure 14. The vibration signals from the vibrometer are normalized and plotted in Figure 15(a); the corresponding PSD through FFT with peak picking is plotted in Figure 15(b).

**(a)**

**(b)**

**(c)**

**(a)**

**(b)**

**(a)**

**(b)**

The real-field application environment is always challenging for many image-based methods because of the presence of various noise sources and uncontrollable conditions. As shown in Figure 13(a), the direct result obtained using the method in [34] appears to be noisy, and the histogram of its predicted frequency distribution in Figure 14(a) indicates that most pixels were predicted under 1 Hz; this is attributed to the uncontrollable lighting condition and other types of mechanical or electrical noises. To solve this problem and obtain a reliable result, an additional edge enhancement process must be applied to their method [34]. The result of the proposed method in Figure 13(b) shows that although not many pixels were confidently outputted, the result is free of trivialities. In addition, noises under 1 Hz are not observed in the direct output of the prediction result map. The histogram of the predicted frequency distribution of the proposed method in Figure 14(b) also confirmed that most pixels are predicted around 12.8 Hz, which is also closer to the vibrometer result of 12.97 Hz than the result obtained in [34]: 12.6 Hz (as shown in Figures 14(b), 15(b), and 14(a)).

#### 4. Discussion

The experiments were conducted under different conditions and by using different excitation sources and different test objects vibrating at various frequencies. The results demonstrate the following findings about the performance of the proposed method: (1) The proposed method can directly utilize the video clips containing the target object to predict vibration frequency within a minor-error range. (2) Under both ideal laboratory conditions and uncontrollable real-field application condition, the proposed method can generate reliable result maps with little trivialities. Moreover, we noticed that the performance of the proposed method was more robust in real-field conditions even when using the direct output; additional enhancement was not required as in [34].

Compared to the conventional image-based vibration frequency measurement method, the proposed method does not explicitly extract vibration signals to obtain vibration frequency. We directly obtained pixel-level vibration frequency results by using a feedforward neural network with a confidence kernel and no additional image or signal processing algorithm. Compared with a previous method [34], the measurement accuracy was slightly increased owing to the use of a novel network architecture and better training dataset. In particular, the confidence kernel greatly improved the applicability and robustness of this method as only a small amount of trivial result was generated, and the map of the pixel-level vibration frequency result was clearly presented with high accuracy. Moreover, no statistical operation, such as a histogram or edge enhancement [34], was required. Thus, the output result map is directly useable without any additional operation.

The major limitation of the proposed method is that if a different measurement range is required, a new model must be trained. Another important limitation is that only the principal frequency can be measured; if the measurement target is vibrating with more than one frequency component, all frequency components except the principal frequency component will be ignored. Moreover, the proposed method is vulnerable to lighting variation, camera shake, disturbances caused by background or foreground moving objects, and other types of electric or mechanical noise; these are common limitations of all image-based measurement methods.

Nevertheless, the presented work opens up many opportunities for future research. The adoption of machine learning in image-based engineering measurement is relatively new and can be explored in other types of measurements, such as rotation speed measurement and displacement measurement. Hence, our dataset is limited that it does not contain any real vibration signals and the signal simulation algorithm is very simple. A dataset augmented with real-world signals or the use of a better simulation algorithm may further improve this method. More advanced neural network architecture or deep learning techniques can also be incorporated to improve the performance of this method.

#### 5. Conclusions

In summary, the authors presented a novel method to measure the vibration frequency based on machine learning and confidence kernel by using an industrial camera as a sensor. The performance of the measurement system in terms of reliability and accuracy was evaluated through experiments. The experimental results demonstrated that the proposed method can accurately and stably measure vibration frequency with the advantage of noncontact, target-less, and simple hardware and a concise pipeline. We expect a future work to provide a flexible measurement range and estimation of multiple vibration frequency components by using a single trained model.

#### Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

#### Conflicts of Interest

The authors declare no conflicts of interest.

#### Acknowledgments

This work was supported by the National Natural Science Foundation of China (11372074).