A leading cause of death from natural disasters over the last 50years is witnessed by none other than earthquake occurrences which have a negative economic impact on the world and claimed thousands of lives over the years, causing devastation to properties. In this paper, a novel Ensemble Earthquake Prediction Method (EEPM) is proposed and implemented to produce a strong learner (ensemble method) having better accuracy in prediction, less variance, and less errors. Data (parameters) which is continuous in nature is collected from two countries, India and Nepal, for five years, and surveyor’s data (precursor) which is categorical in nature is collected from three countries India, Nepal, and Kenya for five years on the specific earthquake-prone regions. The preprocessed data is generated by combining parameters and precursor data. EEPM focuses on detecting the accurate and better early signs of an earthquake and finding the probability of occurrence of an earthquake in the specified region, i.e., better prediction and robustness. The results of EEPM produced better and less variance and less error in comparison to individual machine learning methods as well as better accuracy 87.8%, compared to state-of-the-art ensemble methods. The prediction of earthquake will alarm not only the people of the society but also the different organizations to explain the appropriate range of magnitude and dynamics of occurrence of earthquake.

1. Introduction

The meaning of prediction of an earthquake means concise forecasting of the time, size, and location of an impending earthquake in isotopic and geochemical precursors of earthquakes and volcanic eruptions. The researcher stated that earthquake forecasts in different regions considering factors such as local geological conditions, ground motion, and animal behaviour are taken into account for seismic casualties. For earthquake precursors, solutions have been searched in past studies, precisely in the USA, Japan, China, Israel, etc., by supervising regional studies based on parametric features. Researchers on the other hand in India, USA, Pakistan, Nepal, etc. have investigations on earthquake prediction using parameters which are based on the assumption on all regional factors, which can be filtered out, and general information about the earthquake patterns based on the parameters can be extracted. When the energy stored in elastically strained rocks is suddenly released, this release of energy causes intense ground shaking causing occurrence of earthquake in the area near the source of the earthquake, which generates energy in the form of waves throughout the Earth, elastic in nature called seismic waves. Earthquakes can be generated by many geographical factors like sudden volume changes in minerals, sudden slippage along faults, ground motion, bomb blasts, volcanic eruptions, heavy rainfall, rock bed material, regional tectonics, and altitude noticed by meteorologists, seismologists, and geologists. Based upon 80 systematically selected, high-quality peer-reviewed research papers, using data mining methods, implementation on earthquake prediction is being thoroughly read and analysed. Different data mining methods are carried on the data sets to generate the accurate prediction on the probability of occurrence of earthquake.

In this research paper, the parameters and precursors of earthquake-related data are collected from different sources, then combined to get a unique preprocessor data having all the required features for prediction. The categorical data are converted to numerical so that a common numerical data set format having necessary attributes is considered. This unique preprocessed data can be utilized by different individual methods for analysis and logical results of regression. Pearson coefficient correlation through feature selection and proper data splitting ratio is selected, fitted with a contributing method, and then trained to minimize training errors by reducing dimensionality, increasing computational efficiency. This unique data set is used to apply on different data mining methods like KNN, SVM, XGBoost, decision tree, and random forest to generate the result of individual methods and then compared using performance measures like , adjusted , variance, Mean Square Error (MSE), and Root Mean Square Error (RMSE) and shows prediction with high variance and errors [1]. A novel ensemble method, Ensemble Earthquake Prediction Method (EEPM), using boosting over a single contributing method is framed for combining a set of weak learners into a strong learner in each iteration so that the ensemble method counteracts the behaviour to allow for generalization of the methods to new data sets. Boosting of the ensemble method is used to base learners which are generated sequentially by different individual methods in such a way so that the weak learners are eliminated after each iteration. The present base learner is always more effective than the previous one, as the decision tree is applied having a single feature to fit the training data set by choosing the right decision at each split of the tree. The predictions from the trees are combined using a random forest having overlapping results applying averaging for regression. The generated accuracy of the ensemble method generated is compared with previous developed ensemble methods for prediction of the probability of occurrence of earthquake.

2. Literature Survey

The proper selection of data sets plays an important role as few researchers used geological observations and historical data of a particular region or country so that attributes are justified and have a strong relationship with seismic activity. In the literature, the papers dealt with the selection of parameters and precursors with different data sets, to derive a powerful logical preprocessor data having independent and dependent attributes. The preprocessed data as a result becomes a strong and a new data set which is applied on different techniques like regression, KNN, SVM, decision tree, and random forest [2, 3].

The history of earthquake forecasting in a specific region is based on the identify location of fault characteristics or parameters such as depth, magnitude, length, latitude, longitude, and time and precursors such as movement of animals, trees, plant change in temperature, pressure, and radon gas to estimate the occurrence of earthquake. The identification of the magnitude of earthquakes is estimated simultaneously using all available fault parameters and precursors by excluding the chances of inconsistent estimations through data mining technique application for forecasting [4].

2.1. Comparison of Research Findings with Parameters, Precursors, and Previous Ensemble Methods

In the literature, it is concluded that to predict earthquakes, many researchers have asserted by observing multiple parameters based on observational data and developing patterns and relationships and some using multiple precursors and observing the change in seismicity patterns of the region. It has been observed that data mining techniques are capable of delivering better accuracy in terms of prediction for short-term and midterm earthquakes in comparison to large earthquakes. The analysis of the literature review using parametric data is shown in Table 1, and that using precursor data is shown in Table 2. Analyses of earlier ensemble methods are shown in Table 3.

Many researchers have worked on ensemble methods in different application areas. A system is proposed and designed by the ensemble method, combining different data mining techniques and network techniques which can detect an unexpected voltage leak from some electric devices to the house in order to save people’s lives and resources [17]. In this paper, the proposed model achieves integrity by embedding a security feature Elliptic Curve Digital Signature Algorithm (ECDSA) for the predicted area of water bodies which helps to secure the key and the detected water bodies while transmitting in a channel using the ensemble method by combining data mining technique like networking XGBoost, and random forest [18]. In this paper, the effectiveness of the proposed model using various machine learning algorithms such as random forest,-nearest neighbour, and decision tree s used and tested on using actual IoT-based data set which shows better accuracy using the ensemble method [19].

3. Sources of Data Collection

A uniform, nonredundant earthquake catalogue data have been compiled. The parameter data were collected from two countries, India and Nepal, for five years. The precursor data were collected from people who actually experienced earthquakes within the age range from 18 to 75 from three countries like India, Nepal, and Kenya, for five years. Some people have experienced the earthquake more than once. The parametric data has no null records and no outliers as the database is cleaned by the data source house. The earthquake catalogue has included four zones East, West, South, and North of two different countries referred in Table 4.

3.1. Data Source of Parameter Data

(1)The database was provided by the Meteorology Department of India and Disaster Research (India Today)(2)The National Centre for Medium Weather Forecasting is located at Sector-50, Noida(3)India Meteorology Department located at Block M, Lodhi Road, Delhi, has extended their support to avail the historical data. There are many sources used as the historical data were collected from two countries. Few are mentioned below (Source Link file:https://en.wikipedia.org/wiki/Geology_of_India,https://www.ngdc.noaa.gov/nndc/struts/form?t=101650&s=1&d=1,https://en.wikipedia.org/wiki/National_Geophysical_Data_Center, andhttps://www.indiatoday.in/diu/story/300-disasters-80-000-deaths-100-crore-affected-india-s-two-decade-tryst-with-natural-calamities-1767202-2021-02-08)

3.2. Data Source of Precursor Data

Through Google Form survey analysis, some survey shows that people having the experience of earthquake more than once in a lifetime are also considered. Repetitive data from survey data is removed, and an accurate database is used for relevance and real-time earthquake events (refer to Table 5.

3.3. Data Set Description

For the parameter data, the input values and attributes are from X1 to X8 and the target values are from Y1 to Y8. The types of target values are numeric in nature. The critical parameter magnitude is the time interval between the arrival of the P-waves and S-waves, and dependent attribute and other parameters are in the dependent attribute. The details of the attributes for parameter data set are tabulated below in Table 6.

For the precursors, the input values and attributes are from P1 to P10 and the target values are from Q1 to Q10 as tabulated below in Table 7. The target values are converted to numeric values. The value of the categorical data of the surveys is denoted by a number like very high temperature is represented by 1—high, 2—low, 3—very low, 4—not sure, and 5—magnitude which is the dependent attribute, and others are independent attributes.

3.4. Preprocessor Data

A preprocessed data is generated by combining both parameters and precursor data sets. The parameters are collected from two countries India and Nepal, and the precursors are collected from three countries India, Nepal, and Kenya. A graphical representation which is easy to identify provides the different elements of a process and understands the interrelationships among the various steps. The sequential steps for developing a preprocessors data are put in a logical order so that data mining techniques can be applied in the future to predict the occurrence of earthquakes as shown in Figure 1.

3.5. Data Integration

Data preprocessing involves different operations which can organise the data in rule-based applications and database driven into a proper format for better interpretation in data mining process. As data integration is a part of data preprocessing which is derived by combining parameters and precursors, standardizing is applied to normalizing routines to transform the data into its preferred and consistent format. Reduced representation in volume is obtained from the number of attributes, number of attribute values, and the number of tuples which produces the same or similar analytical results.

3.6. Feature Selection

After data integration, the attributes are selected and extracted for a new preprocessor data to develop, which is required for linear association strength measurement between two variables by applying Pearson coefficient. The feature selection of the preprocessor data associates independent and dependent attributes.

Some attribute features are not very significant for model construction and prediction, whereas they raise the dimensionality of the feature sets, which can be hard to analyse and require more time for training the data, thus making the decision complex. Lesser attributes give better accuracy; thus, less significant attributes must not be considered in the data set. The attributes which were selected are logically justified and explained in Table 8.

Pearson’s correlation coefficient: here, the correlation is obtained by Pearson’s correlation coefficient by measuring statistical and surveyor data for continuous variables using the method of covariance. The correlation relationships are defined below: (i)It has built a significant relationship with numeric data and also encoded the character data to numeric for stronger relationship and correlation(ii)Higher correlation coefficient and attributes are strongly correlated, and one of them can be discarded(iii)If the correlation constant is 0, then the attributes are independent, and if it is negative, then one attribute discourages the other; i.e., if the value of one attribute increases, then the value of the other decreases(iv)The correlation coefficient is measured on a scale that varies between “+1” and “-1”(v)When one variable increases and the other variable also increases, the correlation is positive; when one decreases and the other increases, it is negative. Complete absence of correlation is represented by 0. The values of all correlation attributes are shown in Figure 2

4. Implementing Existing Techniques for Earthquake Prediction Using Regression

Techniques like SVM, KNN, XGBoost, decision tree, and random forest based on supervised machine learning are applied on the data set to generate the results. Mathematical equations for supervised learning techniques are shown in Table 9. The most satisfactory result is generated by random forest technique as the random forest splits the nodes of preprocessed data and then selects the split which results in homogeneous subnodes. In KNN, a relationship of pressure with animal behaviour is observed; in most of the cases, time has more or less range. In SVM depth with cross-validation, although the time monitored is different, i.e., morning and evening, but there is a relation between falling of leaves and temperature. Applying XGBoost gradient and values shows that the relationship between the atmospheric pressure and temperature is within a range of closer depth of the occurrence magnitude and the different ranges of depth. The findings of decision tree are constructed suggesting that there is a strong relationship with the possibility of occurrence of earthquake having ranges of magnitude with temperature and animal behaviour. The creation of subnodes like temperature, atmospheric pressure, location, and magnitude in random forest shows the relationship with falling of leaves of the tree, water movement in water bodies, pressure, and temperature having specific ranges of magnitude. The graphs generated by all five techniques showed more and less similar results, i.e., attributes among all five techniques which are compared and the factors which are resulting a closer proximity also reflecting the accuracy of findings of the individual existing technique. Once the mapping is done with regression, using linearly on preprocessor data generating a relationship and correlations on attributes in most cases, a consolidated magnitude range is found between 4.1 and 5.14, and the findings of implementation of existing techniques are shown in Table 10.

5. Novel Ensemble Earthquake Prediction Method

Many of the earlier research works on earthquake prediction are carried out on parameters only, i.e., on historical data, as well as many research works are done only on precursors, i.e., based on survey data only by implementing one or more data mining methods. Only a few research works are carried on combining both parameters and precursors as well as using multiple data mining techniques. The facts of limited research on combined data are motivated to generate a preprocessor data by combining parameters and precursors. The principle of combining data sets has been of interest for predicting the possibility of occurrence of earthquake using regression which is sure to produce better results. Although the result generated by individual technique has high variance and errors and restricted to only linear data set as discussed in Section 5, thus, it is motivated to develop an ensemble method by combining the individual method effectively so that the result on prediction not only has low variance and low errors but also has a better accuracy than earlier developed ensemble methods. This new ensemble method will be a precise, robust method by having less training errors and dimensionality as well as can be used on nonlinear data sets in the future. It will also operate on two visible components, one finds the average squared error of the individual models, and the other quantifies the prediction by interacting with individual predictions generated by individual techniques.

5.1. Work Flow of Novel Ensemble Earthquake Prediction Model (EEPM)

The novel Ensemble Earthquake Prediction Method (EEPM) is discussed in a stepwise logical manner, which has been described in different steps like data collection, preprocessing of data, data splitting, implementation, and performance measure evaluation. The work flow of EEPM is shown in Figure 8.

5.2. Working of EEPM
5.2.1. Step 1: Data Collection

The parameters like location, date, time, magnitude, depth, temperature, longitude, and latitude are collected for two countries India and Nepal for five years (refer to Table 1). The precursors like location, date, time, temperature, atmospheric pressure, water movement, animal behaviour, falling of leaves, and rainfall are collected from people who actually experienced earthquakes within the age range from 18 to 75 from three countries like India, Nepal, and Kenya for five years shown in Table 2. The details of data collection are explained in Section 3.

5.2.2. Step 2: Preprocessor Data

A preprocessor data is generated for accurate, consistent, and having completeness by combining both the parameters collected from two countries India and Nepal and the precursors of three countries India, Nepal, and Kenya which are used by different techniques shown in Figure 2 explained above in Section 4. In this step, the process of combining parameters and precursors will be elaborated and shown in Figure 1. Here, parameters X1 and Y1 and precursors P1 and Q1 are dependent data which are combined to get a preprocessed data set used by individual techniques.

5.2.3. Step 3: Splitting Data

Data set is split into two parts; training data and test data are the substrate for estimating parameters, comparing models, and all other activities required to reach a final algorithm. The splitting is performed on 70% of the training data, and 30% of the test set is used for estimating a final, unbiased assessment of the algorithm’s performance.

5.2.4. Step 4: Framing of Ensemble Earthquake Prediction Model (EEPM)

Boosting is defined by a combination of algorithms where weak learners are converted to a strong learner after multiple iterations. The individual new model generates a strong learner with lower bias at the end of each process to focus its efforts on the most difficult observations to fit up for lower variance. The ensemble model is a weighted sum of weak learners. The preprocessor data is initialized, and an equal weight to each of the data points is assigned. Boosting will keep on performing until the correct result is obtained. The steps are as follows (refer to Figure 9). (1)(a) In the training data set, a base model is built to predict the observations in the record training set (), a differential loss function “” which is used to identify the model with a constant value, and the number of iterations is used to find a “” (predicted value) for which the loss function is minimum

(b) As the target column is continuous, the loss function follows the number of iterations with the loss function whereas “” is the vector of input variable which is defined as and ;” output variable or observed variable from and ;” loss function; “” predicted value; “argmin” argument of the minimum; “” number of iterations; “” number of DT (decision tree) made on iterations ( first DT means last DT); “” number of records; “” the previous model; “” pseudoresidual generated on DT; and “” DT made of residuals.

Initialize the probabilities of the distribution as, whereis the number of data points, whendata pointsare looking for the function which produces the output almost equal to. But in real case scenarios, there is a difference between predicted output and actual output . This difference is called a residual . Now, in gradient boosting, another model is trained on the data points .The target variable istoor, on training modelsup to the final model. And the target variable for model is represented as .

An algorithm is fitted on the training data using the respective probabilities. The pseudoresidual is found to fit a new model on the residual. For making a change in the model, the loss function is calculated.

The new model is added to the older model, and the next iteration is continued. (2)(a) The pseudoresiduals are calculated as the number of iteration records; all data points are on the same preprocessor data, and data are converted to numeric

(b) The output value for each leaf of DT are calculated in terms of residuals by taking average numbers in a leaf, refer to the above table for variables (3)Multiple “” is computed by solving the following optimization problem(4)The model is framed(5)Output

The EEPM has resulted in the following findings that probability of occurrences of earthquake is more during morning and abnormal behaviour of animal and falling of leaves are predominant with rise in water movement and mostly in cold temperature. oosting uses gradient algorithm by fitting many models on samples of preprocessed data set and involves fitting of many different techniques using another method and learns to best combine the predictions by different decision trees and random forest, if any extra tree or trees to be included. The training models can be represented by TM1, TM2 up to model which are constructed here by applying data mining techniques like linear regression mode, which is a comprehensive technique, by combining different algorithms, generating graphs, and establishing a relationship having forecasting model like TF1, TF2 up to(refer to Figure 10). The generalization ability of an ensemble is usually significantly better than that of a single learner, so ensemble methods are very attractive and definitely, the ensemble model is considered as shown in Figure 9.

5.2.5. Step 5: Comparative Analysis of EEPM with Different Individual Methods Based on Linear Regression

The comparative analysis of EEPM with existing techniques will be done based on performance measures like , adjusted , Mean Square Error (MSE), Root Mean Square Error (RMSE), and variance. In this paper, these above important measures on different individual methods which are derived using regression are compared to determine the correctness of the results as well as compared with the EEPM to understand the better result of the ensemble method in comparison to individual method. The value of performance measures in SVM is 0.64, KNN is 0.76, XGBoost is 0.74, decision tree is 0.80, random forest is 0.82, and EEPM which is maximum is 0.88. The different values of other measurements like adjusted , variance, MSE, and RMSE are also calculated, respectively (refer to Table 11). The graphical analysis of the results is shown in Figure 11.

5.3. Comparative Analysis of EEPM with Earlier Ensemble Methods

Most of the previous ensemble methods are carried out using either regression or classification and performed either by stacking which ensembles with the best predictions from multiple well-performing machine learning methods. All previous ensemble methods are compared in Table 12 based on different measures. Here, EEPM has used boosting which transforms weak learning methods into strong ones and the accuracy of prediction is an important measure which is used to compare with other previous ensemble methods, and a conclusion is drawn (refer to Figure 12).

The individual technique does not show accuracy in greater extent in the pattern of data and relationship having high bias and high variance using training the data set and test data set, whereas the result in the ensemble method has considered patterns of data and relationship using training data set which has low bias errors and low variance. EEPM by integrating different methods through unique processed data (parameter and precursor) and by combining many weak learners generated one strong learner. The relationship of pressure and temperature with animal behaviour is more prominent, and occurrences of possibility of earthquakes is more in the morning at cold regions having specific a magnitude range between 4.1 and 5.8. The generated is 0.88, adjusted is 0.85, variance is 0.19, MSE is 0.20, and RMSE is 0.44 and having an accuracy of 87.8 by EEPM so it is concluded that EEPM can predict a better forecasting on earthquake-prone areas having better accuracy. It is also concluded that there must be a regular checking by seismologist, metrologies, or different institutions related to earthquake studies during morning, during fall in temperature, unusual animal movement, fall in pressure, and unusual behaviour of trees and water bodies at earthquake-prone areas, which can give higher rate of possibility of forecasting earthquake.

6. Contribution of Work

The great threat of earthquakes in the area, earthquake-prone location, obviously is very important to develop an effective system of risk assessment, and prevention of negative effects of earthquakes is a necessity. In this respect, it should be said that forecasting of the natural disaster is really quite problematic, but still, the application of the ML methods with proper data set and then through ensemble gives an opportunity to predict the possibility of location of an earthquake to occur, having a rage of magnitude and observing some external features like unusual animal movement, falling of leaves, and rapid rise and fall in temperature and pressure though without a definition of the precise date and time of a disaster. Nevertheless, it is possible to forecast an earthquake within a few months, for instance, making people more prepared for the natural disaster taking into account all above mentioned; it is possible to conclude that earthquakes occur regularly in the IUB area and, therefore, they represent a serious threat to people living in the area. The EEPM can induce interest among seismologists and researchers to apply new technologies as well as other ML techniques and different ensemble methods using this as a base and get more accurate result.

The novelty of this research is the genuine data set recorded from the location of actual occurrences of earthquake. The data from people experiencing earthquake is also recorded, and then, a unique data set is prepared based on both data sets. This novel data set is applied on ML for individual prediction, and then, ensemble technique is applied on individual techniques to get a better and accurate prediction.

The limitations of the study are claims of breakthroughs have failed to withstand scrutiny on finding reliable precursors. Occurrences of earthquake are highly sensitive to unmeasurable fine details of the state of the earth in a large volume not just in the immediate vicinity of the hypocentre.

7. Conclusion and Future Scope

The conclusion of the research work states that there must be a regular checking by seismologists, metrologies, or different institutions related to earthquake studies during the morning, during fall in temperature, unusual animal movement, fall in pressure, and unusual behaviour of trees and water bodies at earthquake-prone areas, which can give a higher rate of possibility of forecasting earthquake.

The unique preprocessor data generated can be used by other data mining methods for regression as well as classification. This preprocessor data can be amended by adding one or more attributes for better results and prediction, which can be utilized by different data mining methods.

This novel EEPM is certainly going to enrich the researchers, seismologist, and metrological department to understand the application of different data mining methods individually as well as the power of the ensemble methods for better and accurate prediction. The importance of EEPM can be made stronger applying regression as well as classification by combining more individual data mining methods, more iteration, and multiple decision trees and can also be used by using nonlinear data. EEPM supports to use stacking and applying more individual methods in neural networks and other systems from its decentralized origin.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.


I with gratitude like to pass my appreciation to my supervisor Dr. Prinima Gupta and cosupervisor Prof. (Dr) Felix Musua for their expert advice and encouragement. I would also like to thank Mr. Vobbani Venkateswarlu, Mr. Atharva Kulkarni, Mr. Nikhil Sahu, and Mr. Saikat Das for their honest support and cooperation.