Abstract

Currently, the forecasting of healthcare costs is of significant importance for the finance management of both government and individual citizens. However, the existence of dramatic individual diversity in health status, as well as the extensive complexity of the factors influencing the cost, has made the prediction a challenging task. Thanks to the unprecedented adoption of mobile devices, regular individuals may contribute diverse dimensions of data for the medical cost prediction. Hospitals and healthcare service providers are all setting up their own mobile services and collect user data for analysis. Previous methods usually employed traditional machine learning or simple neural network methods, which are difficult to be applied to the nonlinear medical cost and diverse dimensions of data. Therefore, this paper proposes a multitask learning-based framework for interpretable medical cost interval prediction to address these issues. The framework proposed in this paper first predicts subcost intervals by applying the multidimensional data collected from mobile ends and following the multitask learning paradigm. The total cost interval is then predicted based on this prediction. Simultaneously, the framework derives a decision tree from the parameters of the multitask learning network and calculates the importance of each feature in predicting the cost intervals. This paper demonstrates the method's effectiveness using real-world data experiments.

1. Introduction

The management of healthcare cost is one of the largest challenges in the field of health insurance and healthcare, which can easily lead to a shortage or waste of healthcare resources when poorly managed [14]. Owing to the extensive development of mobile devices, patients and regular citizens can freely contribute their own data for the prediction of the medical cost. Typical organizations like healthcare service providers and hospitals are setting up their own applications towards this trend. Patients and subscribers can use those mobile apps to contribute multiple types of data like demographic attributes, manually inputted daily healthcare records, and even sensing data from smart watches [5]. Therefore, it is of great significance to study the adoption of these data for medical cost prediction, which can bring personalized and understandable services for patients.

Currently, DRG (Diagnosis Related Group)-based payment methods are being widely used to predict costs through characteristic groupings [6, 7], which has strongly motivated the research on reliable medical cost prediction. Various methods are proposed to accurately predict cost ranges and identify key factors for grouping, allowing for efficient resource deployment and timely identification of potential risks. These methods are assumed to bring significant implications for reducing pressure on healthcare resources and improving resource utilization [8, 9] while concealing no significant personal information of patients [1012]. However, because of the various treatment options chosen by individual patients, the amount and composition of medical cost are highly personalized and divergent [13]. Moreover, due to the different conditions of different patients and the influence of factors such as healing time and degree of recovery, it is difficult to fit medical costs with simple linear models. Therefore, the prediction of medical cost requests both the application of various dimensions of data available from mobile ends and feedbacks to users with a deep understanding on the impact of individual and personalized characteristics on healthcare costs [14, 15].

Considering these challenges, traditional methods rely heavily on machine learning models like linear regression [16] and regression trees [17, 18], as well as simple neural network models. However, the overall representation is inadequate owing to the sophisticated correlations among factors. In recent research, deep learning methods [19] outperformed traditional computational methods in various prediction tasks due to their ability to adapt the composition of individual feature factors for better representation [20]. Given the complexity of the components and data dimensions in medical cost prediction [21], deep learning methods can make more accurate and reasonable predictions of overall costs by depicting the correlation between the various costs in addition to predicting individual costs.

Based on the above, this paper proposes a multi-task learning-based interpretable medical cost interval prediction framework. The model takes multiple sources of information about the patient into account, including (1) the patient’s natural characteristics and (2) the stage of the patient’s condition. (3) The patient’s lesion attributes, and outputs the prediction results for each type of cost interval.

The framework is made up of two parts: (1) A multitask-learning framework for interval prediction over data collected from mobile ends. The cost intervals are predicted by the prediction framework in two steps. To begin, a logistic regression approach is combined as a preprocessing of the input neural network data, which is then fed into the neural network to calculate predictions for the various subcost intervals. The total costs are then predicted based on the prediction of the subcost intervals. Among these, the logistic regression method is used to improve the network’s convergence and training speeds. (2) An explainable and personalized decision tree based on the analysis of factor importance in a multilearning task. The Gini coefficient is reconstructed using the multitask learning framework weights obtained from training to build a decision tree, and the importance of each feature is calculated using the decision tree.

The proposed framework owns two advantages for medical cost prediction. On the one hand, the framework predicts total costs by coupling subcosts, allowing all subcost prediction intervals to be in obtained while also capturing the links between subcosts and global payments; on the other hand, the framework can analyze the importance of different factors in the prediction of cost intervals based on the prediction process. Corresponding observations can serve as a foundation for physician triage.

To the best of our knowledge, this is the first time that a multiclassification approach to cost interval forecasting has been used. The remainder of this paper is organized as follows: Section 2 presents work related to cost interval prediction. Section 3 presents the cost interval prediction model for multitask learning. Section 4 presents the experimental results. Section 5 analyses the factor impact. Section 6 presents the conclusions.

The study of cost prediction tasks is becoming more widespread, and one of the widely used methods for health care cost prediction is the regression-based model [22, 23]. To avoid the requirement of general linear models for data to follow a normal distribution, Moran et al. performed prediction using generalized linear models [16]. Panay et al. used the evidence regression method, which is based on the idea that other elements in a set that are correlated for a specific element are placed in a set of similar patients, and the overall predicted expectation is calculated for optimization [24]. Tkachenko R et al. used SGTM-like neural structures for segmented linear prediction [25]. Takeshima et al. defined experimental valuables on which regression models with minimum absolute shrinkage and selection operators (lasso) were built. Explanatory valuables were selected by LASSO avoiding overfitting using the validation data [26]. Based on regression methods, various machine learning methods have been introduced [27, 28]. Taloba et al. in [17] compare the performance of linear regression type Lasso, gradient augmentation of regression decision trees, M5 regression decision trees, random forests, linear regression, and CART regression trees in this task and analyze the advantages and disadvantages of each method.

Due to properties such as end-to-end training and good fitting ability to nonlinear data, neural network methods, in addition to machine learning methods, have been introduced into the prediction of medical costs. Morid et al. compared various methods and found that ANN (Artificial Neural Network) performed the best [20]. In [29], Zeng et al. used multilayer neural networks to construct unsupervised learning models to learn patient representation from medical data. The collection of medical data from mobile devices are also extensively studied. Issues like efficiency [5, 30] and data utilities are thoroughly considered. These studies are complementary to our work.

Generally, for cost prediction, current work is primarily based on patients’ natural attributes and health data, but there are fewer methods for predicting the costs of specific conditions during treatment. At the same time, current methods are based on simple statistical learning and neural networks, and they are incapable of fully exploiting the value of data contributed by patients from mobile ends.

3. Framework: A Multitask Learning Based Framework for Interpretable Medical Cost Interval Prediction

3.1. Problem Definition

For a patient set containing patients, where each patient has a feature set and an element for each feature dimension. For example, the feature set may include natural features such as “age,” daily collected data like heartbeat records and related events inputted by patients. These features are collected and submitted through mobile devices. Corresponding features also involve the disease stage such as “TMN-stage,” and focal features such as “type of comorbidity,” which are both terminologies used for clinic diagnosis of breast cancer. We define a cost interval set , where

In this paper, we take k = 3 as an example, correspond to the intervals of treatment cost, examination cost, and drug cost, respectively, and is the total cost interval., , , .

For a given set of patient features , after inputting it into the model, the set of its corresponding cost intervals is output.

3.2. A Framework for Predicting Medical Cost Intervals Based on Multitask Learning

The framework proposed in this paper consists of three components: data preprocessing; a hard sharing network for subcost interval prediction; and a total task prediction network based on sub-cost intervals. The results obtained from predicting subcost intervals and the raw data outputted by hard sharing are used as inputs for the total cost prediction. An illustration of the framework is shown in Figure 1.

3.2.1. Data Preprocessing: Logistic Regression

The training of neural networks for such data suffers from slow convergence and long training times due to the weak linear nature of the association between medical data and medical cost intervals. Traditional machine learning methods like logistic regression may extract shallow nonlinear association among data, which can benefit the overall training performance of the framework. As a result, in this paper, is calculated using logistic regression before being fitted with a neural network. When the user set of features is entered, the auxiliary information can be obtained.where is the model parameter and the parameter is obtained by optimising the loss function by an iterative method, the loss function is a great likelihood function, .

concatenated with the original data, to obtain:

as input to the multitask learning hard sharing layer can speed up the convergence and training of the neural network.

3.2.2. Hard Sharing Network for Subcost Interval Forecasting

The hard sharing layer consists of the mini-module and Resnet.

Each mini-model consists of a full connection layer, a BatchNormalization layer, and an activation layer (ReLu is used as an example in this paper) which, after the hard sharing layer, gives a hidden layer representation of the data :where l is the number of layers in the network of the mini-module.

In order not to degrade the performance of the network due to degradation caused by nonconstant mapping, a residual network is used in this paper. A residual connection [19] is made for every two mini-modules to obtain the hidden layer :

Put into different full connection layers to obtain predictions for each subcost interval:

Based on this, of is obtained. Where the loss function for the subcost prediction network is defined as follows:

3.2.3. Total Cost Interval Forecasting Network Based on Subcost Intervals

A fully connected layer and an activation layer comprise the total cost interval prediction network. The predicted values of the three subcost intervals, along with the output of the hard sharing layer, are fed into the total cost prediction layer, which produces a prediction of the total cost interval as follows:

From this, the predicted value is obtained for four cost intervals of

The loss function for the overall cost is defined as follows:

It distinguish the differences between and :where and are hyper parameters. In this paper, (8) is selected as the loss function.

4. Feature Importance Analysis

Simply predicting the cost interval may confuse the doctors even if a highly accurate performance is guaranteed. Therefore, a feature-importance-based framework in explaining the prediction of cost interval is further proposed in this part. The whole framework is based on an improved version of decision tree, where multiple factors considered in the prediction model are involved. An illustration is shown in Figure 2.

A decision tree approach is used in this paper to analyze the importance of factors obtained through the multitask neural network in section 3. In contrast to previous decision tree methods simply estimating information gain for a single task, an method tailored to couple with multitasks is designed for the information gain estimation.

Based on the weight parameters of the whole prediction network obtained from training, the weight parameters corresponding to each sub-cost interval is first calculated as a percentage of the total cost prediction layer, which is used as the weight for the Gini coefficient calculation of the decision tree nodes. The original Gini coefficient calculation formula:

For a given matrix of patient features , the input total cost prediction layer is subject to the following calculation: , where the feature elements with the same first numerical ordinal number of the subscript, e.g., refer to common features. Then, for the same feature element , having a matrix of weights , the weight of each feature element in the calculation of the Gini coefficient is calculated as follows:

Then, the weighted Gini coefficient calculation formula:

We build a decision tree using the CART classification tree method [20]. The main idea of the method is to iteratively split the patient set where each subset share identical value on some features. Specifically, when a feature F takes the value f in a sample with users, the sample is divided into two parts and , where is the set of samples with and is the set of samples with . The method calculates the Gini coefficient of each feature at each value, choose the case with the smallest Gini coefficient, and use it to generate this node, with and as patient sets in two child nodes. When a node’s number of samples falls below a predefined threshold, or when the number of features is zero, the current node’s decision making process is terminated.

The method described above is used to create a decision tree. The essence is to create a binary tree by selecting the features that will give the greatest Gini gain as nodes at each layer. The importance of each feature in nodes is calculated using the following formula based on the generated decision tree:where is the total number of samples, is the number of samples at this node, is the number of samples at the right child node, and is the number of samples at the left child node.

5. Experiments

5.1. Dataset

The experiments in this paper are based on a real breast cancer medical cost dataset. We give links to the data demos at the end of this article. Patient features include age, T stage, M stage, N stage, histological classification, complication and comorbidity, and HER2 attributes. The representation for the feature is shown in Table 1.

In this paper, treatment costs, examination costs, and drug costs are selected as the three subcosts to be predicted and the total costs are predicted by using these three subcosts as auxiliary information. For the different costs, the paper divides the cost intervals as shown in Table 2.

The experiments in this paper use one-hot coding for the representation of the data.

5.2. Parameter Settings

The logistic regression model in this paper employs Newton’s method as the optimization method for the loss function, and the regularization method employs the norm with a regularization strength of 0.5; the neural network employs ASGD as the optimization method, with a starting learning rate of 0.1 and decreasing to 50% of the original every 100 epochs; and the linear layer has a dimension of 128.

The decision tree for analyzing the importance of the influencing factors in this paper uses a CART decision tree with a maximum number of layers of 7. The Gini coefficient is used to calculate the information gain, but unlike the traditional Gini coefficient, the Gini coefficient is improved in this paper, and the specific method is described in Section 4.

5.3. Experimental Results
5.3.1. Cost Interval Forecast Results

First, to verify the effectiveness of the methods, SVM, decision tree, plain Bayesian, logistic regression, and k-means methods were tested on the same dataset in this paper. The results are shown in Table 3. Compared with traditional machine learning methods, the multi-task learning method has significantly improved the prediction accuracy for the four types of cost intervals, which raises the accuracy by 5% on treatment cost, 9% on examination cost, 10% on drug cost, and 10% on the total cost.

It can also be seen that the prediction accuracy of our method is also significantly improved compared to that of the logistic regression-only method. The multitask learning model can effectively reduce the reliance on the linear nature of the data using the logistic regression-only method. According to Figure 3, the model convergence speed is improved after the inclusion of the logistic regression approach.

To verify the effectiveness of the framework in this paper, we test the prediction results when the network layer in the framework is replaced by traditional machine learning methods. Moreover, the accuracy of the prediction of subcosts is tested under different total cost prediction results. According to the results in Table 4, when the total cost prediction is correct, the method fails in all cases for the sub-cost intervals only 21% of the time, which is much lower than traditional machine learning methods. Correspondingly, according to the results in Table 4, when the total cost prediction is incorrect, the sub-cost interval prediction fails in all cases by 24% compared to the correct case, which is the largest improvement compared to the other cases and remains lower than the traditional method. Thus, it can be demonstrated that the neural network method used in this paper, which can better capture the non-linear relationship between subcosts and total costs, outperforms traditional machine learning methods.

Finally, to verify the robustness of the framework, the operation of the network is tested at different learning rates in this paper, and the results are shown in Figure 4. The convergence rate is fast at higher learning rates, but the accuracy as well as the loss gradually converge to the same level at the end. This proves that the network is stable.

5.3.2. Experimental Results on the Feature Importance

Decision trees based on the trained network are shown in Figure 5. Compared with the decision trees built by the traditional method, the prediction accuracy of the our decision tree for the total cost is 0.71, which is much greater than the 0.45 of the traditional decision tree method. The decision tree generated by this method has more Gini nodes with 0 and a clearer judgement process.

Based on the generated decision tree, the importance of the features is calculated and the results are shown in Table 5 and Figure 6. Among them, N-stage and T-stage are significantly more important than the last five features, and M-stage is significantly less important than the first six features.

6. Conclusion

This paper presents an interpretable and personalized medical cost interval prediction framework based on multitask learning over data on mobile ends. It can predict total cost intervals based on the subcost intervals of the medical process, and the importance of each feature for cost interval prediction can be obtained using a decision tree approach based on the trained neural network’s weight parameters. To begin, this paper uses a multitask learning approach to obtain the subcost intervals in the medical process and mine their correlation to exploit the value of the data; second, the subcosts pass through the full connection layer to predict the total cost intervals; finally, in order to determine the importance of patient characteristics in predicting cost intervals, the decision tree’s Gini coefficient calculation method is reconstructed by using full connection layer weights of subcosts to predict total costs. Furthermore, to improve the speed of model training and convergence, the data is preprocessed using logistic regression methods, and ResNet structure is used to keep the network identity Mapping.

Data Availability

The data presented in this study are available on request from the corresponding authors.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Ministry of Science and Technology of Sichuan Province Program (No. 2021YFG0018, 2022YFG0038).