Abstract

With the advent of the big data era, the use of computers has spread to all walks of life, and the finance and taxation industry is also in the middle of it. The current taxation system is huge and complex, and different tax types are inevitably linked to different economic indicators at a deep level, so tax forecasting requires personalised forecasting analysis for different tax types. This paper selects several tax types that account for a large proportion of tax revenue for prediction analysis, respectively, and conducts fusion research on multi-source big data, including business tax, corporate income tax, and personal income tax. Based on the multi-source big data fusion method, the prediction research on fiscal taxation tax types is conducted, and experiments are conducted with the taxation data of Beijing from 1995 to 2020 to predict the three tax types from 2017 to 2020. The results show that the deviation of the forecast data from the real tax data is small, controlling the forecast deviation to within 14%, indicating the effectiveness of the proposed method.

1. Introduction

Tax forecasting is the first step in budgeting not only for the Chinese government but also for other governments around the world and is an important tool for government financial management. A large number of relevant studies have been conducted in developed countries around the world, and a relatively mature tax forecasting system has been developed. The tax risk management system established by the Organisation for Economic Cooperation and Development [1] is to analyse the many factors affecting taxation and the impact of each factor on taxation, so as to best manage taxation in advance. The core idea of risk management is also the forecasting of taxes, predicting future taxes through current factors.

The US tax system has undergone hundreds of years of modification and refinement and is now very well established [2]. The US is also the first country to practice tax revenue forecasting. Tax forecasting has become a prerequisite for the US Congress to discuss changes in tax legislation and fiscal spending arrangements and plays a key role in ensuring the stability of government tax and spending policies [3]. Some areas of the US use time series to forecast tax revenues and provide a basis for changes to the fiscal system. For example, New Jersey’s tax budget is based on the results of seasonally adjusted time series analysis, using seasonally indexed smoothing and seasonally adjusted ARIMA models [4, 5]. The application of such models eliminates the non-stationarity of the time series and provides a relatively high degree of predictive accuracy. They also use other models for prediction, such as neural networks and support vector machine models, but there are relatively few practical applications of these algorithms.

The Government of Canada has developed a large set of macroeconomic analysis models using estimated parameters and predictive analysis methods. The models contain hundreds of equations and the parameters are estimated using OLS estimation [6] and LS estimation [7] error correction models, which consist mainly of economic models and fiscal models. The tax forecasting model allows for forecasting and analysis of labour tax [8], corporate income tax, and personal income tax. The Spanish National Tax Administration has developed a forecasting function between taxes and GDP, based on macroeconomics [9]. By inputting a variety of economic indicators into the forecasting model, the finance department can obtain tax forecast data, while using tax return data, actual tax revenue data, and net income from taxes for tax forecasting and in-depth analysis of the economic factors affecting tax revenue [10].

The Italian tax authorities [11] have developed more than 500 macroeconomic econometric forecasting models. The economic indicators and forecasting models can be selected according to the type of tax, such as the relationship between operating surplus and corporate income tax and the relationship between employee income and personal income tax. The models take into account not only conventional economic factors but also unexpected factors such as policy effects and natural disasters, which are adjusted to eliminate such effects when making forecasts [12]. The model uses an adaptive dynamic model, which takes into account new influences during the forecasting process and keeps the model in a constant state of transformation to obtain the most accurate forecasts.

Sun uses a linear model to fit multi-source fusion data [13] and tests the model on a small sample medical data set show that the method helps improve the accuracy of linear regression model estimates. The accuracy of the coefficients of the linear regression model is improved and a test for fungibility is proposed. Due to the problems of distribution bias and non-uniform measurement standards in datasets from different sources, it is not possible to integrate all the data directly [14]. A number of algorithms have been proposed in existing research to address distribution bias, but in the adaptive literature, the focus has been more on using specific algorithms to address discrepancies between datasets from different sources.

Some scholars have also used graphical model methods to propose hypothesis testing methods for the existence of distribution bias in different datasets that such data can be fused, and they have also considered the classification problem of fused datasets using support vector machine methods in conjunction with Alzheimer’s disease datasets [15]. In recent years, with the continuous improvement of data collection techniques, research related to the problem of large-scale multi-source data fusion has become a major hotspot. When each data source stores data with a large sample size, processing data from all sensors faces high computational costs as well as storage costs. In this paper, we propose a subsampling approach to the problem of fusing large samples of multi-source data, where only a portion of the total data with high representativeness can be extracted instead of a large-scale full sample of data. Zheng [16] presented the statistical theory of leverage score importance sampling and illustrated the feasibility and effectiveness of the method through theoretical analysis and extensive random simulation studies.

China’s research on tax revenue forecasting began in the 1980s and can be divided into the following three stages. The first stage is to count the data and present them using some simple charts and graphs [17, 18], focusing on the operational analysis of the correlation between tax revenue and various economic indicators. In the second stage, theoretical methods of econometrics were introduced, and with the help of scientific theories, some simple modelling algorithms were used one after another. Regression analysis began to be applied to the study of tax revenue forecasting. In the third stage, from the late 1990s onwards, more and more analytical models were applied to fiscal and tax revenue forecasting [19], and more attention was paid to the research of model algorithms. The forecasting accuracy of the models became higher and higher, and the forecasting results were correspondingly more scientific and accurate.

Taxes are the main source of revenue in China, and each year the government formulates a tax plan based on the economic development of the previous year [20]. Tax forecasts are the basis for the government to carry out the next fiscal budget. In order to develop a scientific tax plan, local governments have started to actively work on tax forecasting. The economic indicators chosen in this paper include GDP, added value of tertiary industry, fixed asset investment, industrial input, investment in real estate development, total import and export, industrial added value, industrial electricity consumption, total retail sales of social consumer goods, sales area of real estate development, and total consumer price index. The economic indicators related to different tax types also vary, and they need to be analysed operationally and verified using technical tools. The text selects several taxes that account for a large share of tax revenue for predictive analysis, namely, business tax, corporate income tax, and personal income tax, and uses a multi-source big data fusion approach as the rationale for the predictive study.

2. Requirement Analysis

2.1. Business Requirements

At the national level, fiscal revenue has an important impact on the operation of the national economy and the stable development of society. The first step in how to monitor fiscal revenue is to be clear about the sources of fiscal revenue so that it can be monitored at source. Previously, forecasts of fiscal revenue were mainly judged by experience, but due to China’s fluctuating economic development, economic development is guided by new policies every year [21]. Taxes are generated in economic development and are inevitably linked to economic indicators, so after clarifying the relationship between taxes and economic indicators, tax forecasts can be made through economic indicators.

Taxes that account for a large proportion of tax revenue in the tax system include business tax, corporate income tax, and personal income tax [22]. As can be seen from the chart analysing the proportion of taxes in local taxation in Beijing in 2020, business tax accounts for 31%, while corporate income tax and personal income tax account for 10% and 28% of the analysis. These three major taxes account for close to 70% of the revenue, so it is important to make forecasts for these three taxes. The specific distribution of each tax category in 2020 for the Beijing Local Taxation Bureau is given in Table 1.

The share of these different taxes in the total value of all taxes is shown in Figure 1.

Different taxes require separate forecasting [23]. There are two starting points for forecasting taxes; firstly, the laws of change in the operation of the tax itself should be explored, but only if the current environment is relatively stable, such that the tax will inevitably operate according to the laws. In fact, China’s economic policies are adjusted every year, and although the overall economic operating environment is relatively stable, adjustments will be made locally. This paper forecasts business tax, corporate income tax, and personal income tax separately based on the fusion of multiple sources of big data. Because any one forecasting method has its own advantages and disadvantages, a comprehensive forecasting analysis using multiple forecasting models can filter out the models with relatively high forecasting accuracy. In order to obtain prepared forecasting results, the data collection environment is particularly important, and the datasets themselves are obtained through government-type websites such as the Beijing Municipal Bureau of Statistics and the National Bureau of Statistics.

2.2. Functional Requirements

On the basis of meeting business requirements, the application platform functions in this paper include two major functional modules, namely, regression model management and forecasting data management. Regression model management is the use of trained models for tax forecasting, i.e., the output models can be used to make separate forecasting applications, such as linear regression. Prediction data management can manage the tax data predicted by the regression model, calculate the error between the real data and the predicted data, and remove the tax data with large prediction errors, so that the regression model can reforecast the data.

2.3. Understanding of Requirements
2.3.1. Sales Tax

By analyzing the relationship between business tax and related economic indicators, it is conducive to accurate prediction and analysis of business tax. With the relevant economic indicators identified, annual forecasting analysis is applied separately to multi-source data fusion methods and the forecasting results are evaluated to determine the most scientific model and to be able to use the training model for the next stage of tax forecasting.

The tax subjects of business tax [24] are units and individuals engaged in taxable tax and transfer of intangible assets and sale of immovable property, and their taxation is based on turnover, transfer amount, and sales, and the scope of levy involves the construction industry and the tertiary industry. Based on business experience, the following economic indicators are tentatively determined, social fixed asset investment, GDP, total retail sales of social consumer goods, and sales area of commercial properties. Firstly, the correlation between these economic indicators and the business tax is carried out from the data, and the collinearity between these economic indicators is analysed to filter out the economic indicators that have a more obvious impact on the business tax.

2.3.2. Corporate Income Tax

The study analyses the relationship between the corporate income tax and the economic indicators associated with the corporate income tax that affect it and provides a forecast analysis of the corporate income tax. Corporate income tax [25] is a tax levied on the income from the production and operation of enterprises and other production units. It is levied on the income of taxpayers and ranges from sales income, labour income, property income, interest income, etc.

The following economic indicators are determined based on business experience: GDP, total industrial output value above scale, added value of tertiary industry, area of commercial properties sold, total import and export, and RMB loan balance of financial institutions. The correlation between these economic indicators and corporate income tax was first conducted from the data, and the collinearity between these economic indicators was analysed to filter out the economic indicators that have a more obvious impact on corporate income tax. Forecast analysis of the corporate income tax is carried out and the forecast results are evaluated to determine the most scientific model. At the same time, a monthly time series forecasting analysis of corporate income tax is conducted to explore the quarterly pattern of corporate income tax.

2.3.3. Personal Income Tax

Personal income tax is a tax levied by the taxing authority on the legal income of natural persons in the country. The forecasting analysis of personal income tax requires firstly the study and analysis of the relationship between personal income tax and the economic indicators that affect personal income tax. The following economic indicators, GDP and per capita disposable income of urban residents, were determined based on business experience. The correlation between these economic indicators and personal income tax is first analysed in terms of data, as well as the covariance between the economic indicators, so as to filter out the economic indicators that have a greater impact on personal income tax and finally establish the relationship between personal income tax and the economic indicators that affect personal income tax in relation to them.

3. Model Algorithms

3.1. Multi-Source Data Fungibility Test

The question of whether the fusion of multiple sources of data helps to improve the testing of coefficient estimates of linear regression models is considered here. The single-source data linear regression model can be expressed aswhere is the dimensional response vector, is the dimensional design matrix, is the dimensional vector of true coefficients, and is the dimensional noise variable and is assumed to be . The least squares estimate of is

Assume that there is data source, denotes the design matrix for the th data source, and denotes the corresponding response vector. If it is assumed that the potential relationships between the predictor and response variables are the same across sources, it is feasible to estimate the coefficient estimates for each data source using the same . Removing this assumption, the difference between the coefficient vector for each source and the shared coefficient vector for all sources can be denoted as where . Then, for each source , we have

Assumption 1. ; in the multi-source data case, the weighted least squares estimate iswhere is the weighting factor for each source.
To test whether the estimation of model coefficients for a particular data node can be improved with information from the data of other nodes, the common mean square error is used as a measure of merit. Without loss of generality, source node 1 is taken as the reference and is set; let in equation (4), at which point equation (4) becomesWhen , the error of is minimised.

3.2. Multi-Source Data Regression Methods for Large Sample Scenarios

From equation (2), if column is full rank, the ordinary least squares estimate for is

The predicted value of the response vector can be obtained from equation (6). Order represents the hat matrix, the diagonal element of which is the leverage score of the th sample point, which reflects the degree of influence of the th sample on the predicted value. By performing a singular value decomposition on , can also be expressed as , where is a dimensional column orthogonal matrix whose columns are left singular vectors. Thus, the leverage value score for the th sample point can also be expressed aswhere is the th row of matrix . When the sample size is large, the computational cost of the full sample estimate will become very high. Therefore, a viable alternative is to resort to subsampling methods and then use subsamples to calculate estimates of the coefficients. The steps of the subsampling algorithm are as follows.(1)Construct sampling probability for all original sample points and draw samples from the full sample according to this probability distribution, denoted as , and construct thecorresponding sampling probability matrix . Uniform sampling: make ; leveraged sampling: .(2)Solving the ordinary least squares model on a subsample yields the unweighted subsample estimate .

3.3. Subsampling Multi-Source Big Data Regression Algorithms

In the analysis above, before integrating the source data together, the value of the weight parameter needs to be set such that the variance of is minimised when . Similarly, the weight parameter was set to , where is the sample standard deviation of the subsample after sampling from source .

3.4. Shared Partial Coefficient Algorithm

Step 1. Let be the sampling probability distribution and draw subsamples by rows from all data from each of the sources, with subsamples noted as . Fuse into a new dataset .

Step 2. Use the fused data to calculate the least squares estimate .

4. Forecasting Methodology Research and Analysis

This section focuses on the full process of researching the forecasting and analysis methodology for different taxes, with forecasting and analysis of business tax, corporate income tax, and personal income tax, respectively.

4.1. Data Collection

Collecting data is the first step in conducting predictive analysis of the data, and in order to collect data of high quality, the source of all identified data in this paper is mainly relevant official websites. The data were divided into a training set and a validation set, with the actual range of the training set being 1995 to 2017 and the data from 2018 to 2020 serving as the validation set. Only the influence of the weight of the independent variables on the dependent variable is discussed in the data mining process, and the influence of units is not considered here.

For analysis purposes, the data need to be split into a training set and a validation set, with the actual range of the training set being 1995 to 2017 and the data from 2018 to 2020 serving as the validation set. The tax and economic indicators are not in the same units, some are in RMB billion, some are in USD, and some are in million squared. However, in the data mining process, only the influence of the weight of the independent variable on the dependent variable is discussed, so the influence of the units is not considered here.

In this paper, three data training sets are used: the annual dataset of business tax, the annual dataset of corporate income tax, and the annual dataset of personal income tax. At the same time, the validation set for the relevant tax types should be prepared. According to the conventional allocation ratio, the ratio between the training set and the validation set is 7 : 3, and due to the limited amount of data collected for the whole data, only three records for each tax type are selected as the validation set. The training sets for sales tax and related indicators are shown in Table 2.

For the years 1995–2017, a detailed comparison of the data for each category in the sales tax is shown in Figure 2. As can be seen visually in Figure 2, the size of GDP of the various taxes during the period 1995–2017 far exceeds the size of the other taxes, increasing at an increasing rate as the years progress. Sales tax accounts for the lowest share and has been at the bottom end of all taxes during the period.

The validation set for sales tax and related indicators is shown in Table 3.

The annual data for the corporate income tax and related indicator training set are shown in Table 4.

A comparison of the data for each subcategory of the corporate income tax and related indicator training set is shown in Figure 3. The sizes of the five corporate income tax amounts are compared in Figure 3 using radar images, where the area containing the larger area corresponds to the larger tax amount, giving a visual indication of the share of each subcategory in the total amount.

The annual data for the corporate income tax and relatedindicator data validation set are shown in Table 5.

A comparison of the data for each subcategory of sales tax and corporate income tax is given in Figure 4. The graph shows the share of each subcategory of sales tax and corporate income tax, respectively. The larger the area of each coloured area in the graph, the greater the share of its corresponding tax in the total tax revenue.

The annual data for the individual income tax and related indicator training set are shown in Table 6.

The data pairs for the 3 subclasses in the individual income tax prediction training set table are shown in Figure 5. Figure 5 shows that both GDP and disposable income have increased in their subcategories compared to personal income tax, and the relationship shows an equal incremental relationship as the year increases.

The annual data for the individual income tax and related indicator validation set are shown in Table 7.

A comparison chart of the three subcategories of the annual data for the personal income tax and related indicator validation set is shown in Figure 6.

As can be seen from Figure 6, the tax on disposable income far exceeds the other two subcategories in 2018–2020, with GDP being the second most taxed and the personal income tax subcategory the least taxed, showing that disposable income has the greatest impact on the overall tax.

Fiscal tax forecasting analysis was carried out based on multi-source big data fusion under the three tax scenarios, using a subsampling multi-source big data regression algorithm to integrate the three sources of data together for forecasting. The forecast results for the three tax categories for 2018–2020 are shown in Tables 810.

A comparison of the data information in these three tables is shown in Figure 7. The relationship between the true and predicted values for the three tax categories is visually represented in Figure 7, where the closer the lines are for the two categories, the more accurate the prediction is and the worse the prediction is for the two categories.

As can be seen from the data information above, the forecast results for business tax are better compared to those for corporate income tax and personal income tax, with a lower error rate in their forecast data compared to the formal data. The forecasting results for all three tax types achieved good error rates, producing errors within acceptable limits given the large magnitude of the data, demonstrating the effectiveness of a tax forecasting methodology based on multiple sources of big data.

5. Conclusion

This paper has combined statistical data and multi-source big data fusion analysis tools for tax forecasting research and evaluated the proposed subsampling method for multi-source data fusion in a big data scenario to reduce computational effort and storage costs. Compared to the method without data fusion, the estimation of the model with sampling followed by fusion is superior. In this paper, three aspects of business tax, corporate income tax, and personal income tax are predicted, respectively, and the forecast results of corporate income tax and personal income tax are compared. The whole idea of forecasting is still the principle of multi-source data fusion and the process includes six steps: business understanding, data understanding, data preparation, data modelling, model evaluation, and model release. The experimental results show that corporate income tax has the highest proportion of the three tax types, while the prediction error rate of business tax is the lowest. The impact indicators considered in this paper for tax forecasting are limited, in practice, containing more influential factors of interference, and the next work will explore more influential factors to make tax forecasting more accurate.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.