#### Abstract

In order to reduce the investment risk, the evaluation standard of transmission line project investment planning becomes higher, which puts forward higher requirements for the reasonable level prediction of transmission line project cost. This paper combines principal component analysis (PCA) with the least squares support vector machine (LSSVM) model and establishes a point prediction model for transmission line project cost. Based on the analysis of the error of the point prediction model, the kernel density estimation (KDE) method is innovatively introduced to estimate the prediction error, and the probability density function of the error is obtained. Then, according to different confidence levels, the corresponding cost intervals are obtained, which means that the reasonable level of transmission line project cost is obtained. The results show that the coverage rate of the cost prediction interval under 85% confidence level is 88.57%. This conclusion shows that the model has high reliability and can provide a reliable basis for the evaluation of transmission line project investment planning.

#### 1. Introduction

With the rapid development of national economy, the demand for power energy is increasing. Transmission line project is an important part of power grid construction, and rational evaluation of its investment planning is an important part of cost control. At present, cost control line [1] and general cost [2] are mostly used as evaluation criteria in power industry, and a specific value is given, which makes the evaluation results less compatible. Therefore, in order to make a reasonable evaluation of investment planning and determine the reasonable cost level of transmission line projects, reasonable cost intervals should be given on the basis of specific cost control lines.

There are many factors affecting the transmission line projects’ cost, which have the characteristics of randomness and instability. However, the general point prediction results cannot represent the variability of the transmission line project cost, and the information provided by them is often insufficient to meet the requirements of investment decision-making, which brings risks to the decision-making work. If the deterministic point prediction results can be given, and the fluctuation range of cost can be described at the same time, it will be helpful for power enterprises to make more reasonable investment decisions and make more reasonable investment planning evaluation. Therefore, the purpose of this paper is to study the interval prediction method of transmission line project cost, get the fluctuation interval of the cost at a certain confidence level, and obtain the prediction results in the sense of probability. The results of probabilistic prediction can provide more valuable information to decision makers, help them better grasp the changes of data, and also help power enterprises to make investment planning, risk analysis, and reliability evaluation.

With the development of machine learning and intelligent algorithm, the research of project cost point prediction has developed rapidly, and the prediction accuracy has greatly been improved. Ji and Abourizk constructed a special absorption Markov chain, which takes into account the uncertainty of rework caused by quality, and stochastically modeled the manufacturing process of building products so as to estimate and control the rework cost caused by quality [3]. Lesniak and Juszczyk established a regression model based on the artificial neural network to estimate field management cost quickly and reliably [4]. Bhargava et al. introduced a risk-based polynomial model and Monte Carlo simulation to predict that the project will follow a specific cost increase path in its development phase and will produce a given level of cost deviation severity [5]. Lesniak and Zima proposed a case-based reasoning method for estimating the construction cost of sports venues and CBR method based on historical data and sustainable development criteria was used to estimate the initial cost of construction projects [6]. Juszczyk et al. put forward a method of predicting the construction cost of stadium based on neural network and evaluated its prediction quality and accuracy [7].

The support vector machine (SVM), proposed by Vapnik [8], is an effective method based on statistical learning theory. The algorithm does not use the principle of minimum empirical risk to minimize the training error but is based on the principle of structural risk minimization to minimize the upper limit of generalization error so that the global optimal solution can be obtained theoretically [9, 10]. Least squares support vector machine (LSSVM) can effectively simplify computational complexity and improve operational efficiency by changing inequality constraints to equality constraints [11, 12], which makes it have great advantages in multifactor prediction. Liang et al. proposed a hybrid model based on the wavelet transform (WT) and LSSVM and optimized it with improved cuckoo search (CS) so as to achieve accurate load forecasting [13]. Kang et al. proposed a hybrid ensemble empirical mode decomposition (EEMD) and LSSVM methods to improve the accuracy of short-term wind speed prediction [14].

The research on the prediction method of engineering cost has developed rapidly, and many scholars have carried out in-depth research on the prediction of electric power engineering cost. Kong et al. established a cost prediction model of transmission and transformation project based on SVM which used SVM to solve the regression equation, and then the cost was predicted by the model [15]. Wang took the comprehensive cost index of transmission and transformation as the basis of project investment decision-making and probed into the establishment of a model for project investment cost prediction by using the Markov chain [16]. Lu et al. took full account of the characteristics of subitem cost and adopted different methods to forecast separately and then superimposed to get the total cost [17]. Wang et al. established the EEMD-BP model and used BP algorithm to forecast the trend components, and the final prediction results were obtained by considering the prediction values of the trend components and the fluctuation intervals [18]. Yi et al. evaluated the global sensitivity of input variables and proposed a neural network prediction method of transmission line project cost based on feedforward and postpropagation multilayer perception structure [19]. Wang et al. established the REGR-WNN prediction model and compared it with REGR and WNN models separately, and the prediction accuracy of this method is higher [20].

Some scholars have done some research in the area of interval prediction. Grounds et al. believed that prediction intervals show a series of values with specific probability and have the potential to improve decision-making compared with point prediction, so interval prediction may have important benefits [21]. Samal and Tripathy used a nonparametric kernel density method to express wind speed measurements that were not suitable for parameter distribution and used the chi-square test and Kolmogorov–Smirnov goodness-of-fit test to evaluate their applicability [22]. Fan et al. used nuclear density estimation to express the maximum allowable temperature in a period of time on the basis of estimating the instantaneous state real-time thermal rating (RTTR) when predicting the RTTR of overhead lines [23]. Amara et al. discussed the influence of temperature on the nonlinear relationship of power demand and proposed an adaptive conditional density estimation (ACDE) method based on kernel density estimation (KDE) to improve the accuracy of load forecasting [24].

#### 2. Research Method

##### 2.1. Principal Component Analysis (PCA)

Principal component analysis (PCA) is a method of transforming a set of variables that may be correlated into a series of linear irrelevant variables by orthogonal transformation. The transformed variables are called principal components, which can eliminate the multiple collinearities between data. PCA can be divided into the following steps:(1)The original index data are a *p*-dimensional random vector . Standard transformation of *n* samples is carried out: where and , and the normalized matrix *Z* is obtained.(2)The normalized matrix is calculated, and the sample correlation coefficient matrix is obtained.(3)The characteristic equation of *R* is solved, and *p* characteristic roots are obtained, and then the number of principal components is determined so as to ensure that the cumulative contribution rate of principal components can exceed 85%. For each , the unit eigenvector can be obtained by solving the system of equations .(4)The standardized index variables are transformed into the main component .(5)PCA extracts *p* totally new and unrelated variables by concentrating *p* observation variables as follows:

Before PCA,

After PCA,

##### 2.2. Least Squares Support Vector Machine (LSSVM)

Although the traditional support vector machine (SVM) is good enough to avoid falling into the shortcomings of the local optimal solution, it will prolong the running time of the computer if the capacity of the data set is large because the SVM uses quadratic programming in the process of solving. Therefore, Suykens proposed the least squares support vector machine (LSSVM); that is, LSSVM is the improvement and perfection of the support vector machine [25]. LSSVM is characterized by solving linear equations rather than quadratic programming problems, which can simplify the calculation process and reduce the solving time effectively. As a result, its application has become more and more widespread. Regression algorithms for LSSVM are described as follows:

Given the training set , where represents the *m*-dimensional input vector and represents the output vector corresponding to . The regression model is constructed from the nonlinear mapping function as follows:where represents the weight vector and represents the offset. According to the model complexity and fitting error, the objective function of the LSSVM algorithm is as follows:

The constraint condition iswhere represents regression error and represents the penalty coefficient. The function of is to adjust the error, and the larger the value of , the smaller the corresponding regression error will be. For the parameter in the model to be optimized, the Lagrange function iswhere denotes the Lagrange multiplier. According to the quadratic programming KKT condition, the following results are obtained:

After eliminating and , the final matrix linear equations are obtained from the Mercer condition as follows:where and is an *n* *n* unit matrix, and the LSSVM regression function is obtained by solving the final equations as follows:

In this paper, the radial basis function is chosen as the kernel function:

##### 2.3. Interval Prediction Theory Based on Kernel Density Estimation (KDE)

In point prediction of transmission line project cost, the predicted value is only the approximate value of the real value , and the probability that the predicted value is exactly equal to the real value is very small. In order to ensure the reliability of investment decision-making, it is necessary to estimate an approximate range of the predicted value and how much credibility (or confidence level) the range covers the real value. This range of variation is generally expressed by intervals, known as confidence intervals. When the number of samples and confidence level remain unchanged, the length of the confidence interval is inversely proportional to the accuracy of interval estimation. Confidence intervals generally have a two-sided confidence interval and one-sided confidence interval and . The confidence interval calculated in this paper is a two-sided confidence interval, which is defined as follows.

Let be a sample of the population , and the random interval of statistics and is called an interval estimate of . For a given confidence level , satisfy the following equation:where and are the lower and upper confidence limits of the error values, respectively. Interval is called the confidence interval under the confidence level . This paper uses the equal tail confidence interval:

The probabilistic density function estimation problem is to estimate its probability density function through samples. There are usually parametric, semiparametric, and nonparametric estimation methods. Nonparametric density estimation only takes the data of the sample itself as the basis of probability density function estimation. It does not need to make assumptions about the form of sample distribution beforehand and can deal with arbitrary density distribution. The commonly used nonparametric density estimation methods include histogram density estimation method and kernel density estimation method. Although the concept of the histogram density estimation method is simple and easy to use, the result is discontinuous; that is, the density estimation value at the boundary of the region will drop to 0 suddenly, and the efficiency is low. Kernel density estimate (KDE) is proposed by Parzen, also known as Parzen window estimation. And it is a very effective nonparametric density estimation method [26]. Its general expression is as follows:where is the total number of samples, is the bandwidth or smoothing parameter, is the given sample, and is the kernel function, satisfying the following conditions:

KDE can be regarded as the integration of forms centered on each observation sample point. Its performance depends on the selection of the kernel function and window width. If the selection of window width is too large, some characteristics of distribution will be concealed and excessive averaging will make the estimator deviate greatly; if the selection of window width is too small, the whole estimation, especially the tail, will be disturbed greatly, thus increasing the variance trend.

This paper uses relative error to define the deviation between the predicted value and the actual value of construction cost:

The error probability density function can be obtained by nonparametric kernel density estimation, and then the cumulative probability distribution function of relative error (as a random variable of error) can be obtained by integral shown as follows:

According to the cumulative probability distribution function of the error and the point prediction value of the sample, the confidence interval with confidence level can be obtained as follows:where , , and is the inverse function of .

The main research ideas and methods of this paper are shown in Figure 1. Firstly, this paper considers many impact indexes of transmission line project cost and extracts and screens the indexes. Secondly, PCA is used to reduce the dimension of the original index, and the corresponding principal component is used as the input variable of the prediction model. Then, a point prediction model for the transmission line project is constructed based on LSSVM. Finally, the probability density function of the prediction model error is obtained by KDE, and then the cost interval at a certain confidence level is obtained.

#### 3. Selection of Cost Indexes and Data Source

##### 3.1. Analysis of Influencing Factors of Cost and Screening of Indexes

The purpose of this paper is to study the interval prediction method of transmission line project cost and to guide the investment decision of transmission line project. The engineering characteristics of transmission line projects play a decisive role in project cost. Therefore, in order to predict the transmission line cost, it is necessary to comprehensively analyze the factors affecting the cost, and select the engineering characteristics that have a greater impact on the transmission line cost as the indexes affecting the transmission line cost, combining with the practical experience of transmission line project.

Transmission line project can be divided into six major units, namely, foundation project, tower project, erecting engineering, grounding project, annex project, and auxiliary project. Therefore, when identifying the influencing indexes of transmission line project cost, this paper firstly analyzes the six major units of the project and identifies their main influencing indexes. For the cost of each unit project, combined with the engineering practice experience, the engineering characteristics related to the site where the project is located, large quantities of works and high price of materials and other indexes which have a greater impact on the cost are selected as the analysis object. After classifying and summarizing, the influencing indexes of the overall cost are obtained, and the specific identification process is shown in Figure 2.

In Figure 2, some indexes have an impact on different unit project cost, so the cost impact indexes in Figure 2 are summarized and divided into natural indexes, technical indexes, and economic indexes as shown in Table 1.

##### 3.2. Sample Data Source

This paper collects the actual data of transmission line project for model empirical analysis. According to the settlement data of transmission line projects completed and put into operation in Xinjiang by State Grid Corporation in 2017, 140 representative projects under 110 kV voltage level are selected. The original data samples in this paper are obtained through index processing [27], as shown in Table 2.

#### 4. Construction of Cost Interval and Empirical Calculation

##### 4.1. Dimension Reduction of Cost Impact Indexes

There may be multiple collinearities among variables, so this paper makes principal component analysis of sample data to exclude the influence of correlation among variables and reduce the number of variables at the same time. Before principal component analysis, the KMO test and Bartlett spherical test were used to study the correlation between variables. The KMO value of 13 indexes is 0.679, and the significance level of the Bartlett spherical test is far less than 0.01. The test results show that the principal component analysis is feasible.

After principal component analysis, aggregated indexes that can express most of the information instead of the original indicators are selected. These aggregated indexes have almost the same function as the original indexes and can meet the needs of analysis. In addition, the selection of these aggregated indexes reduces the number of input indexes of the prediction model, reduces the complexity of the prediction model, and improves the running speed of the model. This paper argues that the aggregated indexes with cumulative variance over 85% can express the vast majority of the overall information. After SPSS dimensionality reduction, Table 3 gives the percentage and cumulative percentage of the total information explained by the corresponding principal components. It can be concluded that 87.699% of the total information can be explained by the first seven principal components, which is over 85%. Therefore, it is considered that extracting the seven principal components can better explain the information contained in the original variables.

In order to establish the expression between the principal component and 13 cost impact indexes, the component score coefficient matrix is obtained by SPSS calculation as shown in Table 4.

According to this matrix, the expression between seven principal components and cost impact indexes can be obtained. Taking the first principal component as an example,

##### 4.2. Establishment and Training of LSSVM Point Prediction Model

The penalty coefficient of the model parameter is set as , and the kernel function parameter is set as to establish the LSSVM cost point prediction model. The input set of the point prediction model is composed of seven principal components which are processed by PCA, and the unit length cost is selected as the output variable. The LSSVM point prediction model is trained with the first 70 samples, and the latter 70 samples are used to verify the point prediction model, and the corresponding prediction values are obtained. The real values are sorted from small to large, and the predicted values and real values, as well as the corresponding relative errors, are obtained as shown in Figure 3.

The point prediction model is the basis of construction cost interval. This paper uses mean absolute percentage error (MAPE), root-mean-square error (RMSE), mean prediction error (MRE), and determination coefficient (*R*^{2}) to evaluate and analyze the point prediction model. The calculation formulas of each index are as follows:where represents the predicted value of project cost, represents the actual value of project cost, represents the number of samples, and and represent the maximum and average value of the actual value of project cost. The MAPE value, RMSE value, MRE value, and *R*^{2} value of the LSSVM model are 8.59%, 4.99, 4.94%, and 88.67%. It can be concluded that the LSSVM point prediction model has high prediction accuracy and can further establish the cost interval.

##### 4.3. Establishment of Cost Interval

On the basis of point prediction, the probability density of prediction error is obtained by nonparametric kernel density estimation according to the relative error of point prediction. Then the probability distribution polarity curve of the prediction error is fitted by cubic spline interpolation, and the and quantile are found. In this paper, the optimal window width is given automatically by using MATLAB, and the kernel function is the Gauss kernel function:

The estimated error probability curve is shown in Figure 4.

Figure 4 shows that the kernel density curve is in good agreement with the frequency histogram, retaining the internal characteristics, and in the tail, it is in good agreement with the normal distribution, with less interference. Cubic spline interpolation is used to fit the probability distribution of relative error of prediction, and the corresponding quantile and are found. The confidence intervals of the predicted values are calculated when the confidence level is 95%, 90%, 85%, and 80%, as shown in Table 5 and Figure 5.

**(a)**

**(b)**

**(c)**

**(d)**

This paper uses prediction interval coverage probability (PICP) to evaluate the reliability of the prediction interval. The formula is as follows:where is the total number of samples and is the Boolean quantity. If the actual value falls within the prediction range, , otherwise . When PICP = 1, it means that all the actual values fall within the predicted range, and the reliability is the highest. If only for the sake of purely pursuing reliability, the endpoint of the prediction interval can be scaled to the boundary value, but the prediction interval thus obtained loses its practical significance. Therefore, in order to get an effective prediction interval, PICP should be as close as possible to the preset confidence level in the actual interval prediction. If PICP is far less than the confidence level, the predicted interval is invalid and needs to be reconstructed. PICP at each confidence level is shown in Table 6.

Table 6 shows that PICP at 85% confidence level is the closest to the confidence level. It shows that the interval prediction model has high reliability at 85% confidence level, which ensures the accuracy of prediction and does not lose practical significance because of the high value. Therefore, the cost interval under 85% confidence level should be selected.

#### 5. Conclusion

The investment planning of transmission line project is of great significance to improve the investment benefit of the power grid project. Strengthening the evaluation of investment planning is an important means to determine investment rationally. Therefore, it is necessary to change the cost prediction from point prediction to interval prediction in order to improve the compatibility and reliability of investment planning evaluation. This paper proposes a reasonable cost-level prediction model based on PCA, LSSVM, and KDE. The PCA is used to screen and reduce the dimension of transmission line project cost data, and the principal component which can basically describe the factors affecting the cost is obtained as the input set of the prediction model. Using the theoretically mature LSSVM point prediction model, the nonlinear mapping between transmission line engineering characteristics and the cost is determined, and the model is trained and predicted. The error of the point prediction model is analyzed, and the probability density function of error is estimated by the KDE method. The corresponding cost interval is obtained according to different confidence levels. Finally, the accuracy and reliability of the interval prediction model are verified by calculating the PICP of different confidence level cost intervals. The results show that the PICP of the cost interval prediction model based on PCA-LSSVM-KDE is 88.57% at 85% confidence level, which not only guarantees sufficient accuracy but also has strong practical significance.

In summary, the reasonable level of the cost prediction model based on PCA-LSSVM-KDE proposed in this paper has good compatibility and reliability for transmission line project cost prediction. This model can have strong practical significance and good application effect in transmission line project cost investment planning and evaluation.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that there are no conflicts of interests regarding the publication of this paper.

#### Acknowledgments

This study was supported by the National Natural Science Foundation of China (NSFC) (71501071), Beijing Social Science Fund (16YJC064), and Fundamental Research Funds for the Central Universities (2017MS059 and 2018ZD14).