Mobile robots that operate in real-world environments interact with the surroundings to generate complex acoustics and vibration signals, which carry rich information about the terrain. This paper presents a new terrain classification framework that utilizes both acoustics and vibration signals resulting from the robot-terrain interaction. As an alternative to handcrafted domain-specific feature extraction, a two-stage feature selection method combining ReliefF and mRMR algorithms was developed to select optimal feature subsets that carry more discriminative information. As different data sources can provide complementary information, a multiclassifier combination method was proposed by considering a priori knowledge and fusing predictions from five data sources: one acoustic data source and four vibration data sources. In this study, four conceptually different classifiers were employed to perform the classification, each with a different number of optimal features. Signals were collected using a tracked robot moving at three different speeds on six different terrains. The new framework successfully improved classification performance of different classifiers using the newly developed optimal feature subsets. The greater improvement was observed for robot traversing at lower speeds.

1. Introduction

Mobile robots are increasingly deployed in real-world environments, such as forestry, mining, rescue, site inspection, and space exploration [1]. Due to the huge variety of different scenarios that can be encountered in each operating environment, mobile robots must be prepared to deal with complex, unknown, and even dangerous terrains. While they can traverse hard flat ground safely at high speed, they may experience slippage, sinking, and embedding events in the face of loose slippery terrains [2, 3]. Other surfaces can be bumpy and rocky, which may result in damage to the robot [4]. Consequently, the terrain itself can become a hazard, referred to as a nongeometric hazard [5]. To achieve efficient and safe navigation, a mobile robot should adapt its driving style, control strategy, or path planning strategy to accommodate characteristics of the terrain.

A lot of efforts have been made to explore the applications of terrain perception for mobile robots. At present, the most commonly used sensing modalities are cameras and LADARs. Vision-based methods present a powerful tool to perceive the surrounding environment, by which texture or color information is utilized to define the terrain. However, it is unreliable as changes in appearance may be caused by factors such as illumination, weather, and camouflaging by leaves [6]. The LADAR-based method emphasizes segmenting terrain from obstacles, not the terrain itself, and it can be affected by factors such as rain, snow, and smog [7]. There are considerable limitations in traditional terrain perception methods and the problem of how to reliably sense terrain properties is a challenging task.

It is well known that human beings can capture information about terrain during walking by sensing it with their feet and by the sound of their footsteps, of particular importance in dark environments [8]. Likewise, acoustic and vibration responses resulting from robot-terrain interactions can be exploited for terrain characterization. In previous studies, vibration-based methods have performed well in classifying ground surfaces with different degrees of coarseness [9, 10]. However, acoustic-based methods are believed to be more suitable in distinguishing material types of different terrains [11]. Moreover, the acoustic-based method is highly sensitive to running water and can be used to alert a robot before driving into a stream. Tactile force/torque (F/T) sensors have been adopted for legged robots to account for the specific leg-terrain interaction [12].

Iagnemma et al. first proposed the vibration-based method for terrain classification [13]. The basic idea is to use vibrations induced in the wheels, the axes, or the body of the robot to classify the underlying terrain being traversed. The method was performed with a planetary rover, in which the PSD function was used to form the feature vector and PCA was used for dimensionality reduction [14]. Weiss et al. used vibration response collected by a cart with hard wheels to classify the interactions [15]. Four data representations including log-scaled PSD, FFT, and other statistical measures were used to perform seven-class experiments with SVM. An average accuracy of 91.7% was achieved with a combined feature vector; however, for gravel, poor accuracy of only 66.83% was obtained. Further to this, Weiss et al. tested the same method using a RWI ATRV-Jr robot and demonstrated that the SVM-based method outperformed other methods, including Brooks and Iagnemma’s method, PNN, kNN, Naïve Bayes, and J4.8 decision tree [2]. In addition, a combination of vision-based and vibration-based methods was reported to significantly improve the classification rates when compared to single sensor-based prediction performance [16]. Finally, Brooks and Iagnemma introduced a self-supervised method suitable for environments with unexpected appearance [5].

Although comparatively little research has been reported on acoustic-based methods, it is a growing area of interest. Ojeda et al. used a microphone to classify robot-terrain interactions; however, the study reported that sound performed poorly as the sole modality for terrain classification, except grass [17]. Libby and Stentz employed an acoustic-based method to classify interactions with three terrain classes and three hazardous objects using a combined feature vector formed by several traditional features including spectral coefficients, moments, and temporal and spectral characteristics [11]. By applying a smoothing technique, an accuracy of 92% was obtained. More recently, Valada et al. performed a novel deep learning-based terrain classification method [18]. Using a convolutional neural network to learn deep features from spectrograms of the acoustic signal, the method was reported to significantly outperform methods using traditional audio features. Christie and Kottege presented a real-time acoustics-based terrain classification system for legged robot [8]. Extracting a 32-dimensional feature vector by combining spectral and temporal features, accuracy of 92.9% was obtained, and using a noise subtraction technique to remove servo noise, it was further increased to 95.1%. Note that the acoustic-based method is susceptible to background noise; however, due to the complexity and great variation in unpredictable background noise, apart from the servo noise subtraction method, few methods for environmental noise have been reported. A general method to evaluate the effect of noise on classification rate is to test the robustness of the proposed method during exposure to different background noises at differing levels of SNR.

In this paper, a new terrain classification framework is presented to improve classification performance. There are two main contributions in our study. First, instead of extracting features from a handcrafted domain, a two-stage feature selection method combining ReliefF and mRMR algorithms was developed to select optimal yet compact feature subsets, which takes both attributes weights and redundancy reduction into account. Moreover, the combination method is more computationally efficient than mRMR working alone. Second, by fusing the predictions from five data sources, a multiclassifier combination method was developed. The predicted class is determined by integrating prior knowledge with the current classification results. The proposed framework has demonstrated promising performance.

2. Overview of the Terrain Classification Framework

The proposed framework involves the following steps:(1)Data collection from tracked robot-terrain interaction(2)Assigning labels to the prepared terrains(3)Splitting the collected signal into short time windows(4)Extracting features from each window(5)Selecting optimal feature subsets using the two-stage feature selection method(6)Training a classifier using the optimal feature subsets(7)Predicting the class labels of these short windows(8)Determining the terrain class by fusing the predictions from each classifier based on the prior knowledge of each data source

A schematic overview of the framework is shown in Figure 1.

2.1. Data Collection and Hand Labeling

In each experiment, the tracked robot was driven over six different types of terrain: brick (), asphalt (), low grass (), firm soil (), gravel (), and soft soil (), as illustrated in Figure 2. For each type of terrain, one location was considered in the experiment. A data acquisition instrument was used to collect the signal. Data was recorded and transmitted to a desktop computer via a router in a real-time manner, and further processing was performed offline. The data was converted into MAT-files and processed within the MATLAB environment. Each sequence was split into short windows of one-second duration and all short windows were assigned a specific class label.

2.2. Feature Extraction

Features were extracted from each window. FFT is the most commonly used feature extraction method for acceleration and is often used to perform transformation from time domain to frequency domain. Moreover, it is the foundation of many other features such as the MFCC and frequency characteristics. Thus, FFT was chosen as a basic feature candidate for both the acoustic and acceleration data. As the SNR is higher in regions where there is more power, only the lower part of the spectrum was used in this study. The truncation point was set to 200 Hz.

MFCCs are perceptually based spectral features that have been successfully used in speech recognition, which basically maps the linear frequency scale to a scale that resembles the frequency resolution of the human ear [19]. Whereas the majority of spectral power of the acoustic signal resulting from robot-terrain interaction ranges between 0 and 200 Hz, the speech signal has more distinctive information in the mid-to-high frequency ranges; therefore, a modified form with greater emphasis on lower frequencies should be developed for this application. The traditional relationship between Mel-frequency and normal frequency can be written aswhere represents the transformed frequency with the unit Mel, represents the linear frequency, and 4000 Hz is the Nyquist frequency.

Transforming (1) to a more general form, it becomeswhere = 2595 and = 700. Substituting = 2146 Mel and = 4000 Hz into (2) yields

The best Mel mapping function curve can be obtained by tuning and . According to (3), we changed the parameter from 10 to 1000 with an interval of 10 and then 100 parameter sets were obtained. A tuned SVM classifier was used to perform the classification with the MFCC feature vector. Finally, the parameter set that delivered the best classification performance was identified and used as the MFCC. The classification accuracy as a function of and is shown in Figure 3. It can be observed that best performance is achieved when and . In this paper, 18-dimensional modified MFCCs were used and denoted as 18-dimensional MMFCCs.

The gianna and shape feature vector [11] is a combination of the gianna feature [20] and shape feature [21]. The gianna feature includes three features extracted from the time domain (ZCR, STE, and entropy) and three features from the frequency domain (spectral centroid, spectral roll-off, and spectral flux). The ZCR indicates the number of times the signal crosses the zero axis per second. The STE is the sum of squares of the amplitudes. Abrupt energy changes in the signal are characterized by entropy. The entropy is calculated by , where denotes the number of subframes that a signal frame is divided into and denotes the normalized energy of a subframe. In this work, a tuned SVM classifier was used to perform the classification with the entropy feature extracted using . Finally, the value of was found to give the best performance and subsequently was used as the entropy feature. Spectral roll-off is a frequency value and a certain percentage of energy lies under it. The percentage was empirically set to be 80%. Spectral flux is used to measure local spectral changes between successive frames and can be calculated by summing and squaring the differences between successive frames. The shape feature consists of four scalar features characterizing the shape of the distribution: spectral centroid, spectral standard deviation, spectral skewness, and kurtosis.

Compared to an acoustic signal, there are fewer feature vectors for acceleration. As mentioned previously, FFT was adopted as the basic feature candidate, and the truncation point was set to 200 Hz. By doing so, the lower part of the spectrum is used, resulting in a higher SNR. Based on FFT, five derived frequency domain indicators were calculated: MSF, RMSF, FC, RVF, and VF [22]. Eleven temporal indicators were also extracted, including the average value, maximum value, and RMS [23].

To summarize, each short window was taken as a dataset and processed to extract different feature vectors. The feature vector candidates for the acoustic set are listed as follows:(i)18-Dimensional MMFCCs(ii)200-Dimensional FFT(iii)9-Dimensional gianna and shape feature(iv)227-Dimensional feature vector formed by the above three feature vectorsThe feature vector candidates for acceleration set are listed as follows:(i)11-Dimensional temporal indicators(ii)200-Dimensional FFT(iii)5-Dimensional frequency characteristics(iv)216-Dimensional feature vector formed by the above three feature vectors

All features were normalized to range within before feature selection and classification. Note that all the codes were implemented in MATLAB environment.

2.3. Feature Selection

More features can increase computation load and are also more likely to cause overfitting of the corresponding classification model. In classification problems, there are hundreds of potential features that can be used to characterize a target object; however, noisy irrelevant features provide little information and should be found and removed. In this paper, a two-stage feature selection method is developed combining ReliefF [24] and mRMR [25]. The ReliefF algorithm determines the quality attributes when there is dependency between attributes, while the mRMR algorithm finds attributes that hold the highest relevance with the target class and show maximal differentiation from each other. Since the ReliefF algorithm is less computationally expensive than the mRMR, it is performed in the first stage to select feature candidates as well as reduce computation load in the next stage. An importance value for each attribute is calculated to rank them, and the top 100 important attributes are selected to form a feature vector candidate. In the second stage, the mRMR algorithm is performed to reduce the redundancy of the feature candidates, in which the 100 selected attributes are ranked once again and can be selected based on the new ranking. Consequently, optimal yet compact feature subsets can be obtained to feed into the classification model.

The ReliefF algorithm works as follows: first, an instance is randomly selected from the training sets; then search nearest neighbors from the same class and each of the different classes, which can be called nearest hits and nearest misses . Based on , , and , the importance estimation can be updated, where denotes the set of all attributes. The pseudocode is given in Algorithm 1.

Input: feature vectors and class labels
Output: the vector of estimations of the importance values of attributes
set ;
For to do
  randomly select a set ;
  search nearest hits ;
  for each class do
    find nearest misses from class ;
  for to do

Referring to the pseudocode, is the prior probability of class , estimated from the training sets. Function calculates the difference between the values of the attribute for two instances, and . The variable is the number of attributes in the feature vector. In this paper, and ; both parameters are set by the users.

The mRMR algorithm employs Max-Relevance and Min-Redundancy criteria. Max-Relevance aims to find features satisfyingwhere is the mutual information value, is an individual feature, and is the class label. is the mean value of all mutual information values.

Features selected using Max-Relevance criteria could have rich redundancy; therefore, Min-Redundancy is performed in the next step. The Min-Redundancy condition can be expressed by

Finally, the two criteria can be combined by the operator .

2.4. Classification

For comparison and experimental validation, four conceptually different classifiers were used to perform the classification. The principles of each classifier are briefly described as follows.

(1) k-Nearest Neighbors. The kNN algorithm is perhaps the simplest and most intuitive classifier [26]. It maintains good performance when there are extremely irregular decision boundaries. It is a nonparametric identification method that does not need prior probability and conditional probability density functions; however, it is labor-intensive for problems with large training sets. When classifying an unknown instance, the distance of feature vector to all training vectors is calculated and training vectors closest to are selected. The predicted class is then defined as the most frequently occurring class among the training vectors. Euclidean distance was utilized in this work and for feature vectors, and , the Euclidean distance takes the form

(2) Naïve Bayes Classifier. Due to fast and easy implementation, the Naïve Bayes classifier is widely used in pattern classification problems [27]. The basic idea is to classify an instance to a class that has the highest probability considering its corresponding attributes . It can be expressed as

(3) Support Vector Machine. The basic idea of SVM [28] is to construct a hyperplane to maximize the margin between the closest points of different classes. The method is less prone to overfitting than other methods and therefore performs particularly well for classification problems with small datasets. Two parameters need to be tuned for the SVM model, and . The parameter is used to control the effects of errors. Higher values of lead to a more severe effect. The data is mapped into higher-dimensional space using kernel functions, and, consequently, a nonlinear hyperplane is constructed to separate the data linearly. In this paper, the radial basis function was used. The parameters and were tuned using the grid search and cross validation methods. The SVM algorithm was implemented in the LIBSVM software package [29].

(4) Random Forests. Random forest can tackle problems with high-dimensional feature vectors [30]. It is a combination of tree classifiers. Each tree classifier is built by using a random vector sampled independently from the input vector. Finally, a unit vote is obtained from each tree classifier and the most frequently occurring class label is determined to classify the input vector. A random forest consists of tree predictors, where can be set to any value; was empirically set to 500.

2.5. Multiclassifier Combination

As different sensors can provide complementary information, a multiclassifier combination method can be developed to improve classification performance. Voting principles are perhaps the most general and useful multiclassifier combination methods, aiming to make a consensus by fusing opinions from individual classifiers [31]. The voting principle can be expressed as follows.

Assume that there is a pattern space that consists of mutually exclusive sets: , for    representing classes. In this case, six different terrains are presented; thus, . Assume that is a sample from ; a classifier (denoted by ) is used to assign a label to , , which means that belongs to class . The above event can be denoted by . Assuming individual classifiers , , an input sample is assigned a label by each classifier; then an event is produced. Events are used to establish an integrated classifier , and, finally, a definitive label is given to the input sample ; namely, , .

The event can be expressed as

A commonly used voting rule based on majority is given bywhere .

In this study, the confusion matrix [32] was utilized to describe the errors for each classifier as follows:where . denotes that samples belonging to class are classified into class by classifier , where denotes class and denotes the event .

The number of test samples is given by

The number of samples belonging to each class is

The number of samples that are classified into class is given by

Under the occurrence , the probability that a sample comes from class can be given by

The confusion matrix is believed to be able to reflect the performance of . A sample confusion matrix is shown in Figure 4. In this paper, the confusion matrix was employed as the prior knowledge from each data source, and each dataset corresponds to an individual classifier; thus, the total number of votes for class is given by

Finally, the voting principle based on prior knowledge takes the form

3. Experimental System

A data acquisition system was proposed to record signals, which consisted of a data acquisition instrument (24-bit), router, computer, and three different sensors, as shown in Figure 5.

The acoustic signal was measured using an acoustic pressure sensor placed close to the first road wheel. To reduce the influence of background noise, it was pointed downward perpendicular to ground surface. For shock absorption, a bracket and damping foam block were used. A single-axis accelerometer was mounted on the axis of the first road wheel to collect acceleration data along vertical direction. Vibrations induced in the centroid position of the robot were measured by a triaxial accelerometer along three perpendicular directions, as depicted by the coordinate system shown in Figure 6. It was mounted on the bottom of robot. As shown in Figure 6, the instrumented robot has dimension of 1.3 m × 0.75 m × 0.38 m and weight of approximately 75 kg. Similar studies have demonstrated that driving speeds can significantly affect the generation of sound and vibration. Here, a wireless joystick was used to roughly maintain control of the speed and 3 different speeds were considered: 0.4 m/s, 0.8 m/s, and 2 m/s. In the experiments, the sampling rates for the acoustic and acceleration signals were set to 8 kHz and 1 kHz, respectively.

To preserve symmetry, 80 samples were chosen for each class, such that a total of 480 samples were obtained. Prepared samples were separated into two sets of equal numbers, namely, the training sets and testing sets. Representative examples of the signal token from five data sources at 2 m/s are presented in Figure 7. The horizontal axis on top represents the six different types of terrains. Each terrain type was sampled for a duration of 2 seconds and is shown separated by a dotted line.

4. Results and Discussion

The results of each trial generated a confusion matrix as depicted in Figure 4. For compactness, each confusion matrix is condensed into a single number, called the accuracy [33]. The accuracy can be calculated by averaging the true positive rates across six classes and, in this paper, it was employed as the measure of performance. Results obtained using the individual data source and handcrafted feature vector at 2 m/s are listed in Table 1.

For the acoustic-based method, accuracies between 53.8% and 89.6% were achieved. The best result in terms of accuracy was obtained using MMFCC and SVM, whereas the worst result was observed using gianna and shape and kNN. Interestingly, the results obtained for gianna and shape in this study are contrary to the conclusions reported in [11], in which a wheeled robot was used. In this study, the structural noise resulting from the track system vibration is believed to be the cause of these differences because structural noise can behave differently to the robot-terrain interaction sound. On the other hand, a wheeled robot would not produce as much structural noise as a tracked robot. Therefore, we believe it is reasonable that the gianna and shape features did not perform as well in the previous paper. Nonetheless, structural noise can also be considered as part of the robot-terrain interaction sound because the terrain being traversed causes track system vibrations. Considering all the classifiers, MMFCC outperforms the other two feature vectors. Using the acceleration-based method, FFT was better than the other two feature vectors in terms of performance. The best result was given by data source, FFT, and SVM. Due to poor performance, temporal and frequency characteristics features were removed from further experiments.

Referring to Table 1, for each classifier, the confusion matrices corresponding to the highest accuracy of each data source are employed as prior knowledge, as shown in Figures 4 and 8. It can be observed that different data sources have different distributions for the six terrains. Consequently, it is reasonable to believe that different data sources can complement each other and improve classification performance.

Figure 9 shows classification accuracies obtained using the proposed framework as a function of the number of optimal features. Based on the results, better performance is more likely to be achieved at higher speeds. When travelling at higher speeds, the robot can generate stronger interactions with the terrain, which results in acoustic and acceleration signals of higher magnitudes. The signals in turn lead to clearer terrain signatures. In addition, since the window size is constant for different speeds, a higher speed leads to a longer travel distance such that a larger amount of information is captured. Representative signal tokens at different speeds are shown in Figure 10.

From the classifier standpoint, SVM and RF clearly outperformed the other two simple classifiers; however, it should be pointed out that they are relatively computationally expensive. The highest accuracy of 99.6% was achieved using RF with 90 and 100 optimal features at 2 m/s. Additionally, accuracies given by RF at 0.8 m/s were above 95%, except for the trial with 10 optimal features. However, when traversing at 0.4 m/s, all the accuracies given by SVM were beyond 85% except for the trial with 10 optimal features, while most of the accuracies given by RF are lower than 85%. The worst performance was given by kNN with no accuracy beyond 95%. Moreover, most of the accuracies at 0.4 m/s drop below 80%. The results given by NB presented the least fluctuation in response to the number of optimal features. The ranges of accuracies at the three different speeds were 95.4%~97.5%, 87.1%~91.3%, and 77.5%~82.5%. The margins between the accuracies at different speeds are large and very clear. In conclusion, more optimal features do not always give better accuracy, which indicates that the most useful features should be determined in order to improve classification performance.

Tables 24 show the degrees of improvements made using the proposed framework. Based on the excellent performance values given by MMFCC and FFT, in the future they will be adopted as the benchmark feature vectors for acoustic and vibration-based methods, respectively. In this study, the classification performance of all classifiers was improved. Moreover, greater improvements were achieved for the robot traversing at lower speeds. As explained previously, signals collected at lower speeds have lower magnitudes, and, consequently, the corresponding terrain signatures are more likely to be affected by noise. Nonetheless, in comparison with the traditional methods, our proposed terrain classification framework succeeded in digging out discriminative information hidden in weak signals.

The purpose of terrain classification is to improve robot control and thus, in addition to classification accuracy, the classification time is another important factor used to guarantee real-time implementation. Generally speaking, algorithms resulting in faster classification times are believed to be better for running online. Figure 11 compares the classification times of the four different classifiers on a single sample, measured on a computer with an Intel Core i5-6300HQ CPU and 3.89 GB RAM. The classification times are the average values of 10 runs for each classifier. It can be observed that classification times increase approximately linearly with the number of optimal features. The classification times obtained with NB increase more rapidly with optimal features number than for other classifiers, followed by kNN, SVM, and RF. From the standpoint of classification time, this study suggests that SVM is the best approach. However, the training times required by SVM can be up to several minutes, because the grid search method used to tune the parameters is quite time-consuming. Although the training can be done offline and is less important than classification time, it should also be considered. For larger training sets, the hours or even days of training time required by SVM would be unacceptable. While RF also suffers from long training times, it is hundredfold faster than SVM during the training phase but hundredfold slower than NB. Differing from other classifiers, kNN must use each training sample for online classifications and thus its classification time largely depends on the size of training set. For large training sets, kNN can become too slow to implement online. When there are less than 30 optimal features, NB performs better than kNN and RF in terms of classification time. In general, it is hard to determine which algorithm is the best in terms of computation time, because it is affected by several factors such as the size of the training set, optimal features number, and offline training time.

5. Conclusions

In this paper, a new terrain classification framework was presented. The experiments were carried out with a tracked robot on six different terrains. Multiple sensors were employed to collect signals, and in total five data sources were used. A two-stage feature selection method was proposed to obtain optimal feature subsets, and a multiclassifier combination method considering prior knowledge was developed. Finally, four conceptually different classifiers were employed to perform the classification.

The results showed that the new framework successfully improves classification performance with optimal feature subsets when different classifiers are used. Only a small number of features effectively contribute to classification, which demonstrates the necessity of the feature selection operation. Different distributions of the confusion matrices resulting from five data sources revealed that complementary information can be obtained from the classifier combination. In addition, greater improvements are achieved for signals collected at lower speeds, which means that our approach can successfully dig out discriminative information hidden in the weak signals. Additionally, the accuracies tend to increase at higher speeds, as higher speeds lead to stronger signals and longer travel distances. For real-time properties, the classification times increase approximately linearly with the number of optimal features. Since the computation time is affected by several factors such as the size of the training set, optimal features number, and offline training time, it is difficult to determine the best algorithm in relation to computation time. In this study, the SVM was found to be the best approach in terms of classification time. In comparison to traditional methods, this work suggests that the new framework could handle more complex terrain and increase the probability of detecting danger in advance due to the presence of the acoustic modality. Another advantage is that the proprioceptive sensors used in this study cost much less than tactile sensors, cameras, and LADARs. In future studies, additional types of hazards such as marshland, desert, and stream should be considered. To provide further variation, different locations should be considered within each terrain type.


LADAR:LAser Detection And Ranging
PSD:Power spectral density
PCA:Principal component analysis
FFT:Fast Fourier transform
SVM:Support vector machine
PNN:Probabilistic neural network
NN:-Nearest neighbors
NB:Naïve Bayes
SNR:Signal-to-noise ratio
MFCC:Mel-frequency cepstrum coefficient
MMFCC:Modified Mel-frequency cepstrum coefficient
ZCR:Zero crossing rate
STE:Short time energy
MSF:Mean square frequency
RMSF:Root mean square frequency
FC:Frequency center
RVF:Root variance frequency
VF:Variance frequency
RMS:Root mean square
LIBSVM:Library for support vector machines
RF:Random forests.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


The authors acknowledge the support of the National Natural Science Foundation of China (Grant no. U1564210).