Abstract

This paper proposes a modified Genetic Programming method for forecasting the mobile telecommunications subscribers’ population. The method constitutes an expansion of the hybrid Genetic Programming (hGP) method improved by the introduction of diffusion models for technological forecasting purposes in the initial population, such as the Logistic, Gompertz, and Bass, as well as the Bi-Logistic and LogInLog. In addition, the aforementioned functions and models expand the function set of hGP. The application of the method in combination with macroeconomic indicators such as Gross Domestic Product per Capita (GDPpC) and Consumer Prices Index (CPI) leads to the creation of forecasting models and scenarios for medium- and long-term level of predictability. The forecasting module of the program has also been improved with the multi-levelled use of the statistical indices as fitness functions and model selection indices. The implementation of the modified-hGP in the datasets of mobile subscribers in the Organisation for Economic Cooperation and Development (OECD) countries shows very satisfactory forecasting performance.

1. Introduction

Forecasting is an endogenous process intertwined with the evolution of science. Forecasting methodology is divided into two categories: qualitative and quantitative. Qualitative methods employ the judgment of experts group to produce forecasts [1]. These procedures are mainly applied without using historical data. Quantitative forecasting methods are used when historical data are available as well as the assumption that some of the past patterns will be repeated in the future [2].

There is a variation of quantitative methods such as the time series forecasting which use past trend to forecast the future values of the variable and causal methods that, besides the past trend assumption, also examine the correlation of the variable with other indicators.

The adoption of innovative technologies by a society such as the mobile telecommunications adoption has been discussed and some widely used forecasting models have been proposed. The diffusion processes as well as the produced models are described in the literature [38].

The most commonly used diffusion models are Gompertz, Logistic, and Bass [6] which are dynamic models and follow a sigmoid curve against time. In order to follow the overall diffusion process of the mobile wireless penetration in time, we also employ the Bi-Logistic and LogInLog models which are described in the next section of this paper. The parameters of the models have been estimated by regression analysis with the Least Squares Method [9].

In addition to time response, we investigate the relationship of the produced models with some macroeconomic indicators such as GDPpC and CPI. The core work is an expansion, modification, and implementation of the hybrid Genetic Programming (hGP) method which was presented in [8] in terms of the insertion of new diffusion models as well as the macroeconomic indicators dependence.

The term Genetic Programming (GP) method is a generalization of the Genetic Algorithm (GA) which represents a heuristic method that employs the Darwinian principle of natural selection in finding an appropriate solution of a well-defined problem and every produced solution corresponds to a new program [10, 11].

The basic structure of the paper follows. Firstly, a brief reference to the GP method and the diffusion models are presented. The hGP technique analysis follows as well as the description of the modifications and expansion on it. The next section analyses the results of the hGP implementation. After that, we discuss the forecasting results, and, finally, the conclusion is presented.

2. Genetic Programming Method

GP was introduced by Koza in [11]. In his work, the solution of a problem corresponds to a chromosome-program. The main difference between GP and GA is the representation of solutions. The tree-based representation is adopted by GP method, while a string of numbers represents the solution in GA methodology. The tree-based representation consists of nodes. The nodes represent functions or leaves which correspond to the terminals of the solution, such as variables or constants [12, 13].

The steps for the GP construction are generally the following. Firstly, GP produces an initial population of random programs-solutions composed of the functions and terminals of the problem. The next step iteratively performs the following substeps until a termination criterion will be satisfied: execution of each program and assignment of fitness value according to the precision of each solution [12]. Then, GP generates a new generation of solutions by applying the operations of reproduction, crossover, and mutation. The selection of the candidate solution is performed by probability-based criteria on the fitness value. Reproduction refers to the copy of a solution to the new population. In crossover operation, the selected chromosomes are randomly combined per two and, recombining its chosen parts, generate new chromosomes (offspring) [12]. The mutation changes a function in a chromosome structure with another function. The chromosomes of the new generation have better overall fitness value. The whole process is repeated until a termination criterion is satisfied [9, 11, 12].

3. Diffusion Process and Forecasting with Models

Rogers [7] considers that the adoption of an innovative product by a society follows the diffusion process and it has the sigmoid curve format. In this paper, besides the well-known Logistic, Gompertz, and Bass models, we investigate the Bi-Logistic [14, 15] and the LogInLog which is inspired by the solution of the Dodd model in [16, 17].

A diffusion process is described by dynamic or nondynamic models according to whether the level of saturation is changing over time (“carrying capacity”) [14] or constant, respectively. The differential equation which describes the fundamental diffusion model follows the following formulation: where is the estimated diffusion saturation level for time and is the diffusion penetration and function is the diffusion coefficient.

3.1. Logistic Diffusion Model

The Logistic model is the solution of the differential equation (1) which describes the diffusion process. The Logistic model is described by where is the diffusion of a new product in a society, at time . Also, is a time dependent function and are constant parameters. The constant is the upper limit of the function , known as the saturation level. When time , then [9].

3.2. Gompertz Diffusion Models

The Gompertz model has been extensively used in forecast processes [3, 8, 9]. The Gompertz I format of the equation that this paper proposes is that of

Also, a variation of (3) format is the following Gompertz II format with constant, in where, in both formats, is a time dependent function and are constant parameters [9].

3.3. Bass Model

Bass proposes that the adoption of a new product by a market consists of two major categories: innovators and imitators. The overall diffusion process starts with the innovators adoption of the new product or the innovative technology and then the imitators follow.

The cumulative adoption of the new technology for time is presented in

In (5), parameter corresponds to initial purchasers of the new technology product. Parameter is the sum of the innovators and imitators coefficients, and , respectively, . Parameter is , where is a constant and parameter is [6, 9].

3.4. Bi-Logistic Model

In some cases, the overall life of a product, like mobile telecommunications, has many phases-generations. For this purpose we employ the Bi-Logistic curve which is the sum of two Logistic curves [14]. So, where and . In the first generation, saturation is constant as well as of the second generation. Parameters , , , and are constants and and are the introduction time of the first and second generation, respectively [15, 17].

3.5. Logistic Growing Saturation Level in Logistic Model (LogInLog)

In this case, the saturation level is time dependent and it follows the Logistic diffusion model until the upper saturation of [1517]. The whole diffusion process follows the Logistic model [17], as where and ; parameters are constants. The saturation level follows in where parameter is a constant.

This model describes the diffusion process when an innovative technology has created generations which are not clearly separated [17]. It should be noted that this model is derived by the generalization of the solution of the Dodd model in [16, 17]. The parameters of the model are optimized by the least square regression.

4. Modified Genetic Programming Method

The hybrid Genetic Programming method in fitting and forecasting was presented in a previous work [9]. In this paper, the modified-hGP is presented extensively. The modified hGP implements a strategy which consists of three parts, the nonlinear regression analysis, the genetic algorithm part, and the final model selection. The flowchart of Figure 1 shows the parts of the modified hGP.

4.1. Initialization of Preparation Steps Parameters

This stage of the method contains the preparation steps for the program execution process [11, 12, 18]. The first step is the function set definition. In the modified hGP, the set has two subsets for the arithmetic and mathematical functions, and , respectively, [18]. So, , where and . It should be noted that division is zero protected for the denominator and is the natural logarithm.

In the second step, the terminal set of the variables and constants sets is defined. The variables set and , where , GDPpC, and PCIn are the variables for time, GDP per Capita, and normalized CPI, respectively, and is the randomly generated constants with domain in .

The next step is to define the fitness function for each solution. Various statistical indicators can be used for the fitness function during the evaluation process. Following the previous implementation of the hGP [9], two different fitness functions are used as follows.

In the fitting process, each chromosome is evaluated with the Sum of Squared Error (SSE), as in

In (9), the sum is over the time period . Also, is the real data for time and is the model’s value [9].

In forecasting, the fitness function refers to the weighted sum of squared error (wSSE) function, as in

In this function, a weight is used, in order to give greater weight at the time interval near the last training data [9].

Finally, the maximum number of generations is defined to end the execution of the GP.

4.2. Initial Population

As mentioned before, the function set of the modified hGP is extended compared to hGP. Apart from the primary arithmetic functions set , a mathematical functions set has been inserted; . So, the modified hGP has simplified the chromosomes structure, Figures 2 and 3, while, at the same time, their mathematical efficiency has been improved.

The expressions of the randomly created solutions combine the following primary block format, whereas each part is randomly chosen.Block: .

The solutions of the initial population are the combination of random chosen functions, variables, constants, and primary blocks. Also, the optimized Logistic, Gompertz I, Gompertz II, Bass, Bi-Logistic, and LogInLog diffusion models are being inserted in the population. The parameters of the diffusion models are optimized by nonlinear regression analysis and the Levenberg-Marquardt algorithm has been used [9].

4.3. Solution Representation

In modified hGP, each chromosome is a string of characters and corresponds to a program that is a possible solution to the problem [12]. The inner representation of a string of characters is considered as a parse tree using the abstract syntax trees of Python Programming Language. For example, the chromosomes and are presented in Figures 2 and 3 as strings and parse trees, respectively [12].

The parse tree consists of nodes. There are two types of nodes, the terminal and nonterminal nodes. The terminal nodes (leaves) of the tree contain the variables or the constants. In contrast, the nonterminal nodes of the tree consist of the modified-hGP functions [9].

4.4. Evaluation of Solutions according to Fitness Function Value

As stated above, the best solution is selected according to (9) for fitting and (10) for forecasting purposes. The evaluated solutions are inserted into a sorted Python’s list. The solutions that are not satisfying a precision limit criterion are removed. The remaining accepted solutions of the list are sorted according to their fitness value and they are candidates to become parents for the crossover operation or to be chosen in mutation. In Figure 4, the structure of the list is depicted. It should be noted that the problem of the solutions trapping into local optimum is solved keeping one of all the individuals having the same fitness value in the list.

In tournament selection, a number of solutions from the sorted solutions’ list are selected at random and, then, the best is chosen for the crossover or mutation operation.

4.5. Crossover

In the crossover operation, two parents are randomly selected, according to the tournament selection process, from the sorted by the best fitness value solutions’ list.

In each parent solution, a crossover point is randomly chosen. The substring of each parent beginning at the crossover point is interchanged between two parents’ solutions and the children (offspring) are generated. The crossover operation is presented in Figures 5 and 6.

4.6. Mutation

In the mutation process, a solution is chosen by tournament selection from the tournament list. Once again, a string’s point, which depicts a function, is randomly chosen. The mutation replaces the chosen function from the  ,  set, with a new random function in the solution.

The mutation operation is presented in Figures 7 and 8.

4.7. Fitness Function and Final Model Selection: Statistic Indicators

The fitness function of each individual in the modified hGP method is the Sum of Squared Error (SSE) for the fitting process, as in (9), and the Weighted Sum of Squared Error (wSSE), as (10) presents.

The statistical indices in the modified hGP [9] are the Mean Absolute Percentage Error (MAPE), the Mean Square Error (MSE), the Mean Absolute Error (MAE), and the Root Mean Square Error (RMSE).

MAPE is presented in (11). The is the raw data values, corresponds to the predicted value, and is the number of the data points:

MSE, MAE, and RMSE are presented in (12), (13), and (14), respectively:

In addition, this study has deployed a Bayesian’s Information Criterion (BIC) inspired format [19] in the final model selection for the forecasting, as the following depicts:

In (15), , , and correspond to Fitness Function Value (wSSE for forecasting), parameters of the model, and the number of data points, respectively.

It should be noted that in the final selection process of the appropriate forecasting model we use the half of dataset before the last observed data point.

5. Macroeconomic Indicators

In this section, the macroeconomic indicators of Gross Domestic Product per Capita (GDPpC) and normalized Consumer Prices Index (CPI) will be presented. The GDPpC is a macroeconomic index for the productivity of a country and it could not be considered as index of personal income.

According to [20], the basic index of the value of the goods and services produced by a country is the Gross Domestic Product (GDP). The GDPpC indicates the living standards of the economy in a country.

In general, CPI indicates a weighted average of basic consumer goods prices. Moreover, in this study, the CPI relies on the individual consumption expenditure of households, less energy, and food consumption [20]. It should be noted that CPI is normalized on the CPI of the year 2005. In Figure 9, the yearly GDPpC and CPI for the time period between the years 1997 and 2009 are presented.

It should be noted that, after the year 2008 (“economic crisis year”), the OECD’s GDPpC is decreased for 2009, but, on the other hand, the CPI has bigger tolerance.

6. Mobile Telecommunications Growth: A Brief History

According to [21], the first mobile telecommunications were introduced with analogue networks in the early 1980s, for voice transmission. The second generation (2G) mobile network (Global System for Mobile Communication, GSM) followed in the early 1990s and digital mobile networks were born with the first SMS service. In the late 1990s, enhanced digital generation (2.5G) was introduced for data services. The data services were changed from circuit switched transport (GSM) to packet data transport with General Packet Radio Services (GPRS) and, later, data rates grew with enhanced digital technologies such as Enhanced Data rates for GSM Evolution (EDGE).

Also, in 2003, the next generation (3G) of mobile networks, Universal Mobile Telecommunications System (UMTS), emerged with the first video-calls and, later, (around 2006) was upgraded to High Speed Packet Access (HSPA) with data rates of 14 Mbps in the downlink and 5.76 Mbps in the uplink. Then, HSPA was upgraded to HSPA+ with theoretically 168 Mbps and 22 Mbps for downlink and uplink, respectively, and data services as videos, mobile email, and music. In 2009, Long-Term Evolution (LTE) was launched for commercial usage, while a new generation (4G) of technology is coming [21]. Figure 10 depicts the evolution of mobile technologies generations in parallel with the overall OECD mobile subscribers, contract, prepay, and 3G subscribers, from the year 1997 to 2009 [22].

It should be noted that the number of the mobile subscribers is growing through the technology generations evolution.

7. Results of the Method

The results will be analysed in order to provide a satisfactory prediction for mobile subscribers which consist of mobile contract subscribers and mobile prepay subscribers in OECD countries, as well as mobile 3G subscribers.

This study investigates the implementation of modified hGP on four different datasets. The datasets present the total yearly number of OECD mobile subscribers, the yearly number of mobile contract subscribers, the yearly number of mobile prepay subscribers, and finally the yearly number of mobile 3G subscribers. The observation period begins from the year 1997 to 2009, which is comprised of 13 data points.

7.1. Fitting Results for the OECD Mobile Telecommunications Subscribers

Table 1 contains the initialization parameters for the execution of the modified-hGP concerning the data sets in OECD countries.

The fitting performance of the first modified-hGP model for the total number of OECD subscribers, according to its fitness value (SSE), is presented in Figure 11. Also, Figure 12 presents the errors of the models in time (residuals). The relative statistical indices SSE, MAPE, MSE, RMSE, and MAE of the modified-hGP models are presented in Table 2.

The corresponding modified-hGP model has the following format: + 

As one can see, this method combines different variables like GDPpC or CPI with the independent variable of time. In Table 2, the modified-hGP method achieves excellent statistical performance, showing an SSE value of .

The fitting performance and the residuals for the remaining data sets are presented in Figures 13 and 14 for contract subscribers, Figures 14 and 15 for prepay subscribers, and Figures 16 and 17 for 3G users, respectively. The relative statistical indices SSE, MAPE, MSE, RMSE, and MAE of the produced modified-hGP models are presented in Table 3 for contract, Table 4 for prepay, and Table 5 for 3G subscribers.

The corresponding modified-hGP model for contract subscribers has the following format: + 

It should be noted that this method combines different variables like GDPpC or CPI with the independent variable of time and a variation of diffusion models’ blocks. The performance of the model corresponds to a good enough behavior in fitting process. The error performance in fitting is depicted in Figure 14.

The corresponding modified-hGP model for prepay subscribers has the following format: +  +  + 

Once again, model yields a good enough behavior in fitting process. The error performance in fitting is depicted in Figure 16

Finally, the corresponding modified-hGP model for 3G subscribers has the following format: + 

Finally, the modified-hGP model yields a satisfactory fitting performance. The error performance in fitting is depicted in Figure 18.

7.2. Forecasting Results for the OECD Mobile Telecommunications Subscribers

The forecasting results of the generated models by the modified-hGP method are presented in this section, as well as the combined diffusion models with the modified-hGP models. As mentioned before, the statistic indicator wSSE has been used for the forecasting process.

The initialization parameters for the execution of hGP are presented in Table 6, for the forecasting process. The forecasting method for a 2-year prediction uses 11 data points as training set of the GP method, except for the 3G training set which consists of 7 points. The forecasting performance of the modified-hGP models concerning total OECD mobile subscribers, contract, prepay, and 3G is depicted in Figures 19, 22, 24, and 25, respectively. In every graph, the forecast period window is presented in the blue rectangle.

The corresponding modified-hGP model has the following format, which has a Bi-Logistic behavior and it is time dependent: +  + 

The forecasting performance of the optimized diffusion models, according to their fitness value (wSSE) for the 11 training points, is presented in Figure 20. Also, the relative statistical indices, concerning the whole dataset, of the produced forecasting modified-hGP and diffusion models are presented in Table 7.

Considering Table 7, it can be concluded that the modified-hGP method achieves good statistical indices combining some optimized diffusion models. We can see that the first hGP model achieves a wSSE value of 0.000226, while the best of diffusion models, Bi-Logistic, has a similar 0.000281. It should be noted that modified-hGP model residuals against time (data points), especially for the 2 last data points (the forecast period), show the error response of the GP model (see Figure 21).

The modified-hGP model’s performance, concerning OECD mobile contract forecasting, is depicted in Figure 22. The corresponding modified-hGP model has the following format, which is time and GDPpC dependent:

The forecasting performance of the diffusion models is presented in Figure 23. Also, the statistical indices of the produced forecasting modified-hGP and diffusion models are presented in Table 8.

From Table 8, one can conclude that the modified-hGP method as well as diffusion models achieves good statistical indices. We can see that the hGP model achieves a wSSE value of and the Bi-Logistic 0.000116. Once again, the modified-hGP model residuals against time (data points), especially for the 2 last data points (the forecast period), show the error response of the GP model (see Figure 24).

The modified-hGP model’s performance concerning OECD mobile prepay forecasting is depicted in Figure 25. The corresponding modified-hGP model has the following format, which is time, GDPpC, and CPI dependent:

The forecasting performance of the diffusion models is presented in Figure 26. Also, the statistical indices of the produced forecasting modified-hGP and diffusion models are presented in Table 9.

Table 9 shows that the modified-hGP method achieves a satisfactory performance. The modified-hGP model achieves a wSSE value of 0.000725 and the Bi-Logistic 0.000938. Once again, the modified-hGP model residuals against time (data points), especially for the 2 last data points (the forecast period), show the error response of the modified-hGP model, as Figure 27 depicts.

Finally, the modified-hGP model’s performance, concerning OECD mobile 3G forecasting, is depicted in Figure 28. The corresponding modified-hGP model has the following format, which is time dependent:

The forecasting performance of the diffusion models is presented in Figure 29. Also, the statistical indices of the produced forecasting modified-hGP and diffusion models are presented in Table 10.

Table 10 shows that the modified-hGP method achieves a good performance. The modified-hGP model achieves a wSSE value of similar to the Logistic and LogInLog. The MAPE indicator has the specific performance cause of the initial error at the first data point. As Figure 30 depicts, the modified-hGP model residuals against time (data points), especially for the 2 last data points (the forecast period), show the error response of the modified-hGP model.

7.3. Comparison of the Results with ARIMA Model

The forecasting results of the generated models by the modified-hGP method are compared with those of the ARIMA method derived. As mentioned before, the statistic indicator wSSE has been used for the forecasting process.

ARIMA is an acronym for Auto-Regressive Integrated Moving Average. The model can be written as

In (16), is the backward shift operator. The backward shift operator for th-order difference shifts the data periods back. In general, a th-order difference can be written as

The operator stands for the order of the autoregressive part and operator for the degree of the derivative of part and the is order of the moving average part of (16). The are the parameters of the autoregressive part of the model, the are the parameters of the moving average part, and are the errors [2].

The ARIMA models that are derived by the implementation of the “Gretl, Gnu Regression, Econometrics and Time-series Library” for the aforementioned datasets are depicted below. The forecasting performance of the same modified-hGP models and ARIMA models concerning total OECD mobile subscribers, contract, prepay, and 3G, is depicted in Figures 31, 32, 33, and 34, respectively. In every graph, the forecast period window is presented in the blue rectangle. In Tables 11, 12, 13, and 14, the comparison results of the statistical indices MAPE, MSE, RMSE, and MAE for the two predicted points are presented.

Considering Table 11, it can be concluded that the modified-hGP method achieves better forecasting performance than ARIMA model.

From Table 12, one realizes that the hGP method presents better performance than ARIMA model.

Also, in Tables 13 and 14, the modified-hGP method achieves better forecasting statistics than ARIMA model.

It could be concluded that the overall performance of the modified-hGP achieves better statistic indices than ARIMA model for the predicted data points.

7.4. Robustness of the Proposed Modified-hGP

The proposed method has been tested for the stability and the robustness. The program was executed 20 times in the same dataset of the mobile subscribers. The mean gap between the best and worst solutions was decreasing as the generation was increasing. Also, the curve of the total average of fitness value per generation was decreasing. It should be noted that in Table 15 the program parameters for the testing process are presented. In Figure 35, the mean value for the fitness value for the best and worst solutions (mean gap) per generation of the testing modified-hGP and the average fitness value per generation for the program executions are presented.

The difference between the worst and the best solutions is decreasing. In particular, after the 25th generation, the indices above are converging. The mean gap of wSSE for the worst-best solution begins from value 0.004146752 and ends up to 0.000165101.

8. Towards a Causal Forecasting Model: A Study of the Mobile Subscribers in OECD Countries

The introduction of GDPpC and CPI outside the time variable leads to the creation of causal forecasting models. This method provides a scenario based approach to forecasting. In order to study the future of mobile subscription in OECD countries, three scenarios are presented, according the GDPpC and CPI growth.

The pessimistic one concerns a continuing crisis scenario, so that the GDPpC and CPI growth rates are not increased. The second is a moderate growth scenario and the last one is the optimistic scenario, with GDPpC and CPI getting increased.

A variation of models is generated by the implementation of the modified-hGP method. According to Bayesian’s criterion as well as the wSSE criterion, two models which combine all the variables, GDPpC, CPI, and time, are chosen. Figures 36, 37, and 38 depict the pessimistic, moderate, and optimistic scenario, respectively.

In Table 16, the selected models and their statistics are presented.

The BIC depends on the number of the parameters. The generated models with one variable, like time dependent models, in many cases, have better BIC performance, but not always better forecasting performance.

In contrary, multivariable models, with good enough BIC, yield a good enough forecasting performance.

In the pessimistic scenario, the first model (most likely) achieves 1.538472 billion of OECD mobile subscribers, in the year 2014. The second has a total number of 1.415613 billion subscribers. It should be noted that the GDPpC and CPI growths are unchanged.

In the moderate scenario (the most likely scenario), the first model achieves 1.8 billion of OECD mobile subscribers, in the year 2014. The second has a total number of 1.58 billion subscribers. It should be noted that the average GDPpC rate is 2.5% and average CPI rate growth 1.7%.

In the optimistic scenario, the first model achieves 1.948 billion of OECD mobile subscribers, in the year 2014. The second has a total number of 1.686 billion subscribers. It should be noted that the average GDPpC rate is about 4.4% and average CPI rate growth 2.2%.

9. Conclusions

This paper is a modification of our previous work [9] where the dataset was bigger, but in different area of interest. In this paper, an improved-hGP method was presented. The improved program achieved interesting forecasting models with more variables than one. This GP method was implemented in dataset concerning the mobile subscribers of the OECD countries. The forecasting performance of the modified-hGP as well as the diffusion and ARIMA models was presented and the method presented satisfactory statistical indices.

The proposed method differs from the hGP in some points. Firstly, the diffusion models’ set is extended with Bi-Logistic and LogInLog except for Logistic, Gompertz, and Bass so that the forecast horizon is improved, for long-term forecasting. Also, the functions’ set of the method is extended by the insertion of new functions and function blocks. According to this technique, chromosomes with complicated syntax expressions can be presented with short length expression stings. The tournament selection is implemented for the crossover and mutation operations in order to maximize the algorithm’s efficiency. Finally, a Bayesian inspired criterion has been implemented which, in combination with wSSE, improves the final selection of the forecasting models.

In general, the method could be considered as a forecasting tool that produces time dependent models and causal models for long-term forecasting with more variables than one. It should be noted that this method is compared with ARIMA model and achieved satisfactory performance. Also, the robustness of the proposed method has been analyzed. The implementation of the method is going to be continued on more datasets and it will be compared with other prediction methods in future work.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

The authors wish to express their acknowledgments to Professor Imed Kacem, University of Lorraine, France, for his constructive comments and suggestions, which helped to improve the quality of this paper.