Abstract

The land transportation is a cause of noise in cities, thus breaking the natural balance and bringing with it physiological and mental illnesses, as well as occupational accidents. In this sense, the objective of the research was to estimate the sound pressure in land terminals in the city of Jaen, Peru, using data mining algorithms. The methodology consisted in environmentally monitoring six terminals in the city of Jaen, during 2019, using a class 1 sound level meter; the exploratory analysis of the collected variables that influence the noise of the terminals (minimum and maximum sound pressure level, number of light and heavy vehicles, and equivalent sound pressure level) was performed, which were grouped into three groups of variables for the purpose of using data mining algorithms. Three algorithms were used, namely, artificial neural network, linear regression, and M5Rules, using the free software Weka. Considering all variables, the M5Rules method performed the best, because the value of the mean absolute error (0.7462), the root mean square error (1.0575), and uncertainty analysis (0.09) was the smallest compared to the other two methods. However, for the two remaining groups of variables, the linear regression model showed the lowest mean absolute error and mean square root of the error; in addition to presenting coefficients of determination close to one. The algorithms show good behavior when estimating the sound pressure of the terminals in the city of Jaen.

1. Introduction

Noise is a pollutant that is defined as the emission of energy originated by a set of aerial vibratory phenomena that, perceived by the auditory system, can cause discomfort or ear injuries [1]. Transportation, in addition to accelerated population growth, are the main causes of noise in cities, breaking the natural balance and harming people [2]. However, it causes negative effects on the health of others, bringing with it physiological and mental illnesses [3]. Noise also has implications for society, for example, impact on real estate prices and occupational accidents [4, 5]. Studies have correlated traffic noise variables with respiratory mortality [6]. Therefore, urban local governments should develop noise action plans; in this regard, geographic information on urban characteristics and noise levels is available in some places through smartphone applications [7, 8].

Vehicular traffic variables (flow and type of vehicle, etc.) explain a high variability in noise levels [9] and, in this sense, noise mapping models can be used. The application of Geo statistical methodology for the prediction and assessment of noise distribution in study areas made probability maps for noise levels, for control and monitoring in cities [10]. Simplification methods are becoming a real need for the acoustic community in developing countries, which allows to generate cost-effective traffic noise maps of a small city [11].

Models after the 1960’s were not intended to predict single-vehicle levels but to predict the equivalent continuous level for traffic over a period. The commonly used models FHWA STAMINA, FHWA TNM, MITHRA, CoRTN, RLS90, STL-86, and ASJ-1993 have been analysed [12]. On the other hand, they compare the following models: the European standard, common noise assessment methods for EU member states (hereafter CNOSSOS), Nord2000, and the traffic noise exposure (TRANEX) model based on the UK methodology, in terms of their source and propagation characteristics. The CNOSSOS, Nord2000, and TRANEX models are compared by analysing the estimated noise [13]. In the work of Álvarez Rodenbeek and Suárez Silva [14], they mention that road traffic becomes the main source of noise pollution in Chilean cities; therefore, they validated mathematical models such as SP-48 and SP-96, the Swiss STL-86, the English CoRTN, the one developed by CONAMA01, and the Sánchez model calibrated in Valdivia to predict road traffic noise.

They developed a noise prediction model using multiple regression analysis [15]. They employed machine learning modelling techniques to accurately predict road traffic noise; the techniques applied were decision tree regression, support vector machine, and artificial neural network (ANN) [16]. Artificial neural networks were applied for traffic noise prediction; the results show that the ANN approach is superior in predicting the traffic noise level to any other statistical methods [17]. In addition, they presented the application of emotional artificial neural network as a new generation neural network method for modelling road traffic noise [18]. To predict the noise level, an adaptive neurofuzzy inference system was developed and a detailed comparative analysis was performed with conventional soft-computing techniques such as ANNs, generalized linear model, random forests, decision trees, and support vector machines [19].

Public transportation in Peru is growing rapidly; in Lima alone, by 2019, it ranked seventh in the world traffic ranking with 57% congestion on its roads [20]. In Huancayo, Arana Velarde [21]determined that there is a relationship between the growth of urban mobility, the population increase, and the increase of the population. The effects fall on the population [22]. Therefore, it is necessary to model traffic noise for proper management in the city of Jaen; there are areas where motor vehicle activity is at a higher magnitude, such as bus stops, cars, and motorcycle cabs. These places generate noise levels that exceed those established in Peru’s environmental quality standards [23, 24].

One of the relevant topics is the attention and design of activities based on science modelling and real situations [25]. Support vector machine (SVM), multivariate adaptive regression splines (MARS), and random forest (RF) techniques have been applied to estimate the densimetric Froude number approximation for the incipient movement of riprap stones [26]. Robust artificial intelligence models, namely, multivariate adaptive regression spline (MARS), gene-expression programming (GEP), and M5 model tree were employed to extract precise formulation for the pipes break rate estimation [27]. In this study, standardized precipitation evaporation rate values are formulated for various climates using three robust artificial intelligence models, namely, gene expression programming (GEP), model tree (MT), and multivariate adaptive regression spline (MARS) [28]. Models to estimate traffic noise are adequate tools to guide mitigation measures [29]. Although traffic noise models exist worldwide, they fail to generalize, because local conditions, e.g., vehicle type and climate vary from one location to another [30].

The research is justified because it is necessary to know the sound pressure level generated by the activity of land terminals in the city of Jaen, since it is a determinant of the quality of life of the inhabitants of the city, given the effects it has on their health and well-being.

In this context, the objective of the work was to estimate the sound pressure level of six land terminals located in the city of Jaen by means of data mining algorithms. The models obtained consider the conditions of the study area; these will be made available to institutions or persons interested in evaluating noise pollution and thus take appropriate actions to mitigate this problem.

2. Materials and Methods

The methodology consisted of four steps (Figure 1) as follows: (A) Environmental noise monitoring in six land terminals in the city of Jaen, Peru. Once the data were collected, they were cleaned and integrated. (B) Exploratory analysis of the data set to understand its nature and behavior. (C) Grouping of variables that influence the sound pressure level (LAeqt). (D) Use and evaluation of the behavior of data mining algorithms.

2.1. Environmental Noise Monitoring

Data were collected at six terminals in the city of Jaen (Figure 2) and characteristics of the study area were determined. Sampling was nonprobabilistic, where the choice of land terminals was based on the vehicular flow and activities of each terminal. Sound pressure levels were measured during daytime hours, five days a week, from Thursday to Monday; it started on Monday, November 25 and ended on Sunday, December 22, 2019. The measurement schedule contemplated three shifts as follows: morning shift from 07:01 to 09:30 hours, midday shift from 12:00 to 15:00 hours, and afternoon shift from 18:30 to 21:00 hours.

Sol del Norte was located on Alfonso Villanueva Pinillos Street, in front of the Peru-Ecuador binational park, with exits for light vehicle exits; there was fluid vehicular traffic during the three measurement shifts. Malca was located between Pedro Cornejo Neyra and Alfredo Bastos streets, with exits for light vehicles and interprovincial buses; there was fluid vehicular traffic during the three measurement shifts, where cars, motorcycle cabs, and linear motorcycles were differentiated. Señor de Huamantanga was located on Mesones Muro Avenue, with interprovincial bus departures; there was fluid vehicular traffic in the morning and midday shifts; on the other hand, the afternoon shift was discarded because there were no bus departures, and the arrival of buses was variable, most of which were outside of the measurement schedule. Crucero Jaen and Turismo Fernández were located on Mesones Muro Avenue, with interprovincial bus departures; there was fluid vehicular traffic during the three measurement shifts, where cars, motorcycle cabs, linear motorcycles, and heavy vehicles were differentiated. Tetsur is located between Pedro Cornejo Neyra and La Marina streets, with interprovincial bus departures; there was fluid vehicular traffic in the midday and afternoon shifts; however, the morning shift was discarded because activities began at 10:00 a.m., which was outside the measurement schedule; at this terminal, cars, motorcycle cabs, linear motorcycles, and heavy vehicles were differentiated. Troya was located on Mariscal Castilla Street, with exits for light vehicles; there was fluid vehicular traffic in the morning and midday shifts; in the afternoon shift, the measurement was not performed because activities ended at 18:00 hours; at this terminal, cars, motorcycle cabs, and linear motorcycles were differentiated.

Eight numerical variables were measured at the six land terminals in the city of Jaen as shown in Table 1.

2.2. Grouping of Variables

Of the seven independent variables described in Table 1, these have been grouped into three groups as follows: Scheme 1 composed of all variables. Scheme 2 made up of , , motorcycle cabs, linear motorcycle, automobiles, and heavy vehicle. Finally, Scheme 3 is made up of Lmin, , motorcycle cabs, linear motorcycle, and automobiles.

2.3. Exploratory Data Analysis

Table 2 shows the central tendency and variance statistics for the dependent variable at the six land terminals.

2.4. Data Mining Methods

ANN is a computational model that emulates the biological neuronal system in information processing [31]. The multilayer perceptron ANN (MLP) is a type of network, in which its architecture consists of several layers of interconnected nodes or neurons, each of which is connected to all the neurons in the next layer. The input layer is where the data are presented to the neural network, while the output layer contains the response of the neural network [32]. Linear regressions are statistical models in which the value of the parameters is linearly proportional to the dependent variable [33]. The M5Rules algorithm generates a decision list for linear regression problems and uses the divide and conquer idea [34]. There are other models that were not addressed in this study, such as gene expression programming (GEP), a type of an evolutionary computation (EC) algorithm based on genotype/phenotype. It has also been used in petroleum engineering for viscosity estimation [35, 36]. The multivariate adaptive regression splines (MARS) analysis model, unlike the (GEP) model, is little used in health issues; it is a nonparametric modelling method developed systematically and automatically without limitation of the assumptions that traditional regression models must meet [37].

2.5. Evaluation of Model Performance

It was carried out through cross-validation, using the mean absolute error tests , the root mean square of the error , and the coefficient of determination . These values have been calculated by means of the equations (1)–(3) [38].where is the i-th value produced by the model, is the i-th value produced by measurement, is the number of observations, is the mean of the observed values, and is the mean of the values produced by the model.

2.6. Performance of Uncertainty and Reliability

The objective of uncertainty analysis is to restrict the expected range in which the true value of the result of an experiment lies. This estimated range is in the form of an interval. One uncertainty analysis procedure is . The value of associated with the result of a given experiment can be interpreted; by performing the given experiment repeatedly over and over again, the true value of the result of that experiment will lie in the offered uncertainty range for approximately 95 times out of 100 trials. Moreover, the value of is given by the following equation [39]:

The reliability analysis is a statistical method for measuring the overall consistency of a model. It determines if a suggested model achieves a permissible level of performance. In general, a metric for the reliability analysis is defined by the following equation [39].where is obtained through two steps. First, the relative average error (RAE) is defined as a vector whose component is given by the following equation.

Next, if , then ; otherwise, , which is the threshold value of parameter. In other words, is defined as the number of times the value of is less than or equal to that of . The optimum value of based on Chinese Standards is 0.2 or equivalently is 20%.

3. Results

For scheme 1, the algorithm that showed the best performance in terms of the statistics was the M5Rules. However, for schemes 2 and 3, the linear regression model showed the best performance in estimating sound pressure. In addition, the and reliability indices were calculated (Table 3). Table 3 shows that for scheme 1, the linear regression and M5rules algorithms presented the lowest value of uncertainty with a confidence level of 95% (0.09). In addition, ANN and linear regression for scheme 1 were at a higher level of reliability (100%).

The model obtained for scheme 1, using the M5Rules algorithm consisted of five rules, as shown in Table 4. In addition, for schemes 2 and 3, two multiple linear regression type models were obtained, as shown in equations (7) and (8), respectively.where M is the number of motorcycle cabs; ML is the number of linear motorcycles; A is the number of automobiles and VP is the number of heavy vehicles.

4. Discussion

It is important to consider noise pollution in order to achieve sustainable cities [40]. However, sound pressure levels derived from vehicles in urban environments are becoming more pressing in third world cities, due to the lack of knowledge and low interest of the public administration in this problem [41]. The most immediate solution lies in reducing the flow of public transportation vehicles, as well as updating the vehicle fleet [42]. Vehicle reorganization can contribute to the improvement of air quality and the health of the resident population, as well as to the reduction of environmental pollution [43]. Likewise, in Peru, the Peruvian Congress has implemented strategies to minimize air pollution problems. Through the Organic Law of Municipalities (Ley 27972), it was agreed to monitor signaling systems, land terminals, and pedestrian traffic [44].

Linear regression models can indicate predictive accuracy for any environmental factor [45]. By means of simple linear regression models as well as multiple linear regression models, they were adjusted to the conditions of the urban area and depend on the number of vehicles and equivalent noise [46]. In another study, the mean square error shows that the proposed regression model predicts sound pressure accurately [47]. This study shows that the estimation of sound pressure (0.7462) was efficient.

Data mining has different applications, such as the estimation of the behavior of chemical elements [48, 49] allows to determine predictive models for decision-makers [50], influencing factors of the variables that allow deciphering the behavior of the three models. Therefore, the noise characteristics generated by the vehicle fleet need to be known through economic techniques, such as data mining [51]. Data mining algorithms make it possible to identify urban reality, for example, the collective behavior of citizens. These results allow solving urban problems in transportation, environment, and public safety [52]. The Random Forest algorithm allows the elimination of data and obtain values adjusted to reality [53]. In addition, the Random Forest model was the most appropriate, a model that is widely used to estimate noise, and has been used in telecommunications, the activation of circuits with simple voice recognition, among others [54].

The tightest mean absolute error was for scheme one. The R2 for scheme one was 0.8310; the values were higher than the noise levels measured in London (R2: 0.56–0.73) [55]. In this sense, the method was used to validate the accuracy of the correlation and to compare the sound pressure level [56]. The yielded the lowest error on all the data evaluated, as stated by [49]. Also, the shows the difference between modeled and observed values in absolute terms [57].

5. Conclusions

The estimation of sound pressure in six ground terminals was located in the city of Jaen using data mining algorithms. The M5Rules method was the one that presented the best performance, considering all the variables collected, due to the fact that a mean absolute error value of 0.7462 was obtained for the artificial neural networks and multiple linear regression algorithms. However, for scheme 2 and 3, the linear regression model was the one that showed the lowest mean absolute error and determination coefficients close to one, which presents a better behavior. It is concluded that data mining techniques could be used to estimate the sound pressure of the terminals for the city of Jaen, Peru.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to thank the Professional School of Forestry and Environmental Engineering of the National University of Jaen for the guidance given to the researchers in the development of their research for their professional degree.