Abstract

Objective. The data processing of medical test report has always been one of the important contents in biological information domain, especially the process of extracting the effective information from the report so as to assist doctors with the correct medical plan. Usual methods neglect the implicit relationship between features. More features are generally not a better choice because more noise is generated between feature combinations. We propose a practical feature selection strategy RMFS, which aims to select the optimal combination of features. Materials and Methods. Based on the above situation, in this paper, 64 features are extracted from a real medical test report dataset for stroke and feature selection is defined as a reinforcement learning problem to optimize the feature combination by minimizing regret. We select three current mainstream feature selection methods and conduct comparative experiments. Results. We processed and completed a dataset derived from real medical test reports of stroke. We redefine the feature selection problem as a reinforcement learning problem and propose an optimization strategy based on regret minimization and train weight parameters in a DQN network. Experimental results demonstrate that our method can identify feature combinations with higher prediction accuracy. Discussion. RMFS shows a strong robustness to the randomness of the environment and has high computational efficiency and accuracy. Compared with the previous feature selection methods, our method yields superior results. Conclusion. The experimental results demonstrate that our method can obtain a more accurate prediction rate under the same feature scale and we can achieve baseline performance with fewer features.

1. Introduction

Cerebral stroke is known as apoplexia or cerebral vascular accident (CVA). It is an acute cerebrovascular disease, including ischemic stroke and hemorrhagic stroke, which is caused by a sudden rupture of a blood vessel in the brain or a blockage of a blood vessel that prevents blood from flowing to the brain and results in brain tissue damage. The incidence of ischemic stroke is higher than that of hemorrhagic stroke, accounting for 60% to 70% of all strokes. According to the newly published Global Burden of Disease Study data, the number of stroke patients worldwide is estimated to exceed 100 million. In China, for example, the prevalence of stroke has shown a rapid growth trend from 1.89% since 2012, with an annual growth rate of more than 7%, according to the National Cerebrovascular Disease Data Platform. Data from the Global Burden of Disease Study show that stroke is one of the leading causes of death and disability among adults in China [1]. China is the largest developing country, with a population of about one-fifth of the world’s total, and the number of current stroke patients ranks first in the world. As one of the important components of stroke, more than 20 million people around the world have the potential risk of cerebral stroke, so how to predict the incidence of stroke has become a daunting task. A medical examination report (MER) includes the patient’s personal data in a medical institution as well as examination data, such as identification information, drug allergy history, and medical history. MER not only raises efficiency for doctors and healthcare professionals but also provides a valuable source of data for researchers. The current situation of the prediction of potential risk of stroke involves the use of various clinical risk prediction models [2], such as the Framingham Stroke Risk Profile and the CHA2DS2-VASc Score [3], which take into account various risk factors such as age, gender, blood pressure, diabetes, smoking, and previous history of stroke or heart disease [4]. These models are used by healthcare professionals to identify individuals who may be at a high risk of experiencing a stroke and to guide preventative interventions such as lifestyle modifications or medication. However, there are several existing problems with the current methods of predicting stroke risk. One major issue is that these models may not be accurate enough in certain populations, such as younger individuals or those from different ethnic backgrounds. Additionally, there may be other risk factors that are not yet included in these models, such as genetic factors or lifestyle factors that are difficult to measure. Another issue is that even when high-risk individuals are identified, there may be barriers to accessing preventative interventions, such as lack of resources or inadequate healthcare infrastructure.

To address these problems, further research is needed to develop more accurate and comprehensive risk prediction models and to better understand the underlying mechanisms of stroke risk. A crucial question in our research is how to improve predictive performance by learning the features of patients and diseases so as to perform a better risk control and treatment for the disease [5]. Deep reinforcement learning has some research on this issue, such as attention-based mechanisms [1], but there are still some challenges in effectively utilizing data and model interpretation:(1)Neglected edge informationDue to the numerous examinations in the medical domain, the data sources for predicting a single disease are relatively complex. Only key data are selected as the benchmark for model learning because the sampling probability of edge information in the traditional definition is low and even the edge information might be abandoned during model learning. The approach that uses a graph structure to classify diseases on different levels into different types of graphs is adopted, but it ignores the help of information such as complications for future diagnosis prediction.(2)Optimal solution of sequential decision-makingIn traditional reinforcement learning, sequential decision-making has always been one of the significant research problems. On how to influence the future reward by changing the current strategy, the paper [6] selects continuous partially observable Markov decision processes (POMDP) scenarios and uses approximate solution to infer the potential state, but it neglects the relationship between the solution of the optimal decision sequence and the environmental information.(3)Lack of model generalizationDue to the lack of data, the data sources of different hospitals lead to the mismatch between data features and distribution. Therefore, it may be difficult to learn an accurate model using the data of one hospital and the feature selection of the data is required to select the common data with high importance as the reference index. Many models do not make full use of data, which leads to the unsatisfactory result by lack of generalization.

In view of the abovementioned points, we can conclude that in the current medical environment, reinforcement learning still has some problems in disease prediction. How to choose the optimal combination of features as the input to calculate the optimal decision-making is the problem that this paper studies. Based on it, we will introduce the concept of regret value [7], rank features by minimizing regret values, and learn the optimal combination of features with DQN. This paper has the following major contributions:(i)We redefine feature selection as a reinforcement learning problem, propose an optimization strategy based on regret minimization, and train the weight parameters in DQN network.(ii)We process and complete a data set about stroke, which is derived from the real medical test report of stroke, and the experimental work in this paper is also completed based on this data set. The process is shown in Figure 1.(iii)We selected three mainstream feature selection methods for comparison in the experiment, and the experimental results demonstrate that our method can find feature combinations with a more accurate prediction rate.

Recent work [8, 9] suggests that reinforcement learning has a wide range of applications in medical information processing [10]. By selecting features from medical test reports, extracting feature combinations and learning strategies are two important tasks in this process for different prediction task scenarios. For brevity, we discuss only the medical reinforcement learning literature relevant to our work [11]. This can be roughly divided into three categories.

Feature selection can effectively prepare high-dimensional data for various learning tasks such as classification, clustering, and anomaly detection. In healthcare [12], we need to capture patient heterogeneity for personalized predictive modeling, which can be characterized by a subset of instance-specific features. Reference [13] proposed a novel unsupervised personalized feature selection (UPFS) framework to find shared features of all instances and unique features of individual instances. Feature selection can be applied to case diagnosis, and the authors in [14] explored a nonnegative generalized fusion lasso model for stable feature selection in the diagnosis of Alzheimer’s disease. As technology advances, artificial intelligence (AI) models become critical in the medical domain, and the ability to interpret predictions to clinical end users is essential to harness the power of artificial intelligence models for clinical decision support. Extracted more information from the predictor through an Information Calibration method [15] and used an adversarial-based technique to calibrate the information extracted by the two models.

Feature sources in medical scenes are usually multimodal data of text or images. For medical images [16, 17], feature attribution maps or heat maps are the most common form of interpretation. The Mode-Specific Feature Importance (MSFI) index [18] encodes clinical requirements for prioritizing and localizing specific features within treatment modalities. The study demonstrated that the results produced by MSFI satisfy clinicians’ needs for multimodal interpretation. The authors in [19] described the application of deep learning to multimodal medical imaging analysis.

Reinforcement learning can be used to analyze medical imaging reports and improve accuracy [20], where different modalities of image information have their own characteristics and differences in contrast and resolution due to different imaging principles. Integrated reinforcement learning with MR image manipulation can reconstruct damaged images [21]. Reference [8] proposed and optimized the Stochastic Planner-Actor-Critic (SPAC) method for medical image alignment. Nonindependent and homogeneously distributed (non-iid) data in medical images remain a prominent challenge in real practice. Reference [22] proposed a framework, HarmoFL, where perturbations helped global models converge to optimal solutions by aggregating multiple locally flat optimal solutions without additional communication costs [23]. Low resource medical dialogue generation [24] used the general knowledge graph to characterize the relationship between previous symptoms of the disease. Model-based reinforcement learning can be applied to biological sequence settings, such as DyNA-PPO. A model-based PPO variant was proposed in the paper [25], Model-based Reinforcement Learning for Biological Sequence Design, which had a good performance in biological sequence setting. Off-policy evaluation in reinforcement learning provides the feasibility for using observational data to improve the future medical and educational fields. Gottesman et al. [26] introduced a method as a hybrid artificial intelligence system, enabling human experts to analyze the accuracy of policy evaluation.

In summary, feature selection is an important technique for preparing high-dimensional data in healthcare for various learning tasks. Personalized predictive modeling requires the capture of patient heterogeneity using a subset of instance-specific features. Interpretation of predictions to clinical end-users is essential for clinical decision support. Multimodal data of text or images is common in medical scenes, and deep learning techniques can be applied to analyze them. Reinforcement learning can be used to improve accuracy in medical imaging analysis, considering the different modalities of image information. Non-iid data in medical images remains a challenge, and off-policy evaluation in reinforcement learning [27] provides the feasibility for using observational data to improve the future of medical and educational fields.

3. Preliminaries

In this section, we will introduce some preliminary knowledge for the work in this paper.

3.1. Markov Decision Process (MDP)

Let us assume a standard reinforcement learning scenario, where the purpose is to learn a policy that maximizes the expected cumulative discount reward in a Markov decision process [28], which is defined by a tuple . denotes state, and represents a set of actions. is the strategy for the state transition, and is the reward gained during the state transition, where is the function of reward. is the discount factor. represents the probability of taking action in a certain state to transfer to the next state , denoted as , where is the selected policy action in the current state transition. The experience replay buffer is used in the off-policy agent, which is denoted as . At each time step , the agent interacts with the environment and sores into [29] and is defined as at a certain position . Next, the agent uses obtained from buffer sampling to update for each step during training. Based on the described above, the off-line replay policy learning problem is redefined as follows.

Let the task , off-policy agent , and experience replay buffer . The goal is to learn the replay policy as each training step batch of transitions from to train agent , which is to learn a mapping in order to train the agent to obtain better performance on task .

3.2. Deep Q-Networks (DQN)

We consider the standard reinforcement learning paradigm, including an agent interacting with the environment, and for the convenience of introduction, we assume that the environment is fully observable. Deep Q-Network [30] is a model-free RL algorithm applied in discrete action spaces. Keeping a neural network Q in DQN approximates . represents the greedy strategy w.r.t. . is a random behavior with probability (uniformly sampled from ) that has probability of the action .

During training, we generate episodes by using an approximation of the current action value function of the policy. The transition tuples encountered during training are stored in the replay buffer. The generation of new episodes is interspersed with neural network training. The network is trained using small batch gradient descent on loss so that the approximate Q function satisfies the Bellman equation: , where and the tuple is sampled from the replay buffer.

The targets are usually computed using a separate target network in order to make the optimization process more stable and the target network takes a slower rate change than the main network. It is common to regularly set the weight of the target network to weights of the main network (e.g., [30]) or to use the Polyak and Juditsky averaged [31] version of the main network [32].

3.3. Regret Minimization

Regret value is an important tool for computer to solve approximate Nash equilibrium [33]. The most widely used method in extended game is to minimize the regret value as much as possible to solve an approximate Nash equilibrium [34]. Based on the concept of MDP, it is formally defined as follows:

Here, we specify the action sequence in MDP as . Suppose that player replaces the actual policy with policy , and the part of revenue generated by the new policy over the original policy is the value of regret value. In particular, reward in the regret value can be any mapping from the legal action set to the real number . The regret value can be minimized as long as the total cumulative regret value is sublinear. When the regret values of all actions are sufficiently small, we can consider that our policy is close enough to the Nash equilibrium to solve the problem. Here, we present the procedure of how to update the policy using regret values. When policy is adopted, the virtual value of the corresponding action sequence is calculated as follows:

We first calculate the probability value of the other players in producing the action sequence , multiply the probability of entering the ending situation from the action sequence under this policy, and finally multiply the probability of player in the ending situation . After completing the iteration of the final situation, we add up the products. Therefore, when taking action , the virtual regret value obtained by player is and the regret value of information set corresponding to action sequence is . The regret value of player , when taking action in round is . Similarly, the negative regret value is not considered and is denoted as . In round, the probability of player choosing action is calculated as follows: chooses the next behavior according to the regret value, and if the regret value is negative, one behavior is randomly selected for the game.

4. Optimal Feature Selection Strategy via Regret Minimization

We propose a deep reinforcement learning feature selection algorithm based on the minimum regret value as Regret Minimization Feature Selection (RMFS) to learn the optimal feature combination. RMFS captures the data dependencies between features, enhances features through reward changes between action sequences, and updates buffer by minimizing regret to improve policy learning, as shown in Figure 2. Theoretically, the ideal sampling policy is to sample to the transition with higher value. Therefore, methods such as uniform sampling as well as priority sampling are derived. In general, the policy is uniform sampling, which neglects the significance of experience. The regret minimization framework proposed in this paper increases the probability of reward samples being sampled because we believe that targeted optimization for transition with small immediate reward is essential to improve the performance of the policy. We recommend using the immediate rewards in transition as a reasonable proxy so that the state can be sampled frequently and updating action in transition to improve the immediate rewards. The off-policy algorithm uses deep neural networks as value function approximators and stores past experience in buffer to calculate updated gradients. We assume that is a transition in Buffer and define a reward priority instant reward function as sampling policy. The smaller the value of , the higher the probability of being replay.

In common supervised learning, the training data are assumed to be independent and identically distributed and one or several data will be randomly sampled from the training data for gradient descent every time the neural network is trained. As learning continues, each transition will be used several times. Based on the original Q-learning, a replay buffer is maintained and some data are randomly sampled from the replay buffer to train the Q-network, which can play the following roles: the samples meet the independence assumption. The data obtained by interactively sampling in MDP do not satisfy the independence assumption by itself since is related to . The nonindependently distributed data have a great influence on the training of neural network, which will adapt the neural network to the latest data. ER can break the sample correlation, make it meet the independence assumption, and improve sample efficiency. In the deep Q-network (DQN) algorithm [30], a deep neural network is used to approximate the optimal value function:after experiencing a state and taking an action . Deep Q network is parameterized by deep neural network, where is a parameter. During training, the DQN agent stores its experience into the replay buffer at each time step , which deposits the last million transitions. When implementing the update, by minimizing the loss, small batches of experiences are sampled uniformly from the replay buffer to optimize the deep Q-network with stochastic gradient descent:where represents the bootstrapping target, denotes the parameters of the target network , and is a periodic copy of the deep Q-network. Due to the advantages of combining the deep RL algorithm with the empirical replay algorithm, DQN and its variants [35] demonstrate exceptional performance on our dataset. The specific algorithm is as follows Algorithm 1.

(1)Initialize feature as .
(2)Calculate the accuracy of a single feature.
(3)Initialize replay memory to capacity .
(4)Initialize action-value function with random weights.
(5)for episode = 1, do
(6) Initialize sequence and processed sequenced .
(7)  for do
(8)   Select a feature into as action .
(9)   Execute action in prediction and reward .
(10)   Set and .
(11)   Store transition in .
(12)   Sample random minibatch of transitions
(13)    in .
(14)  end for
(15)end for

5. Experiment

5.1. Experimental Setting and Baseline

In this section, our study was based on data from hospital’s Brain Infarction Screening Program for high-risk populations. The data mainly include demographic information, medical history information, personal history, family history information, and blood index information. In order to better analyze the risk factors of stroke, we fully consider three aspects in the data stage preprocessing: (1) how to fill in the missing data; (2) how to deal with categorical features; (3) how to deal with continuous features. Afterwards, we obtained 64 features in each sample of 6527 patients. The three aspects of data preprocessing are described in detail as follows:(i)Filling in missing data: Because our study was based on regular follow-up of community residents, residents could drop out or be lost to follow-up, resulting in data loss. In the original data set, most attribute values are greater than or equal to “,” and we can uniformly fill the missing values with “−1,” which makes it more accessible to distinguishing the missing values from the normal values.(ii)Classification feature processing: We adopted one-hot coding for classification features (without the difference between the correlation feature value and its actual meaning, such as PayStyle and Job) to obtain the effect of different attributes of stroke, which can make the data distribution more sparse and expand the feature space.(iii)Continuous feature processing: In order to simplify the model and reduce the risk of model overfitting, some continuous features such as age and height are discretized. We map features from different intervals to different buckets.

In order to make a fair comparison and prove the effectiveness of our algorithm, we use the following common feature selection methods as comparison:(i)Chi-square test: It uses the idea of commonly used hypothesis testing in probability theory and mathematical statistics and aims to measure the correlation between two variables.(ii)F-test: It is a hypothesis test method based on F-distribution; that is, it is applied to capture the linear relationship between each feature and label.(iii)Mutual information: A variable that measures the relationship between two random numbers sampled at the same time.

We use DQN to explore feature selection. In DQN, the buffer size of the experience pool is set to ten thousand and each batch size is set to 16. The network structure of DQN is an MLP network with a hidden layer , and the update frequency of target net is once every 100 training times, for a total of 1000 epochs. In the experiment, the learning rate of DQN network was set as 0.01 and 0.001, and the reward discount factor was set as 0.9 and 0.99, respectively, for network training. The parameters of chi-square test, F-test, and mutual information are set the same as DQN.

5.2. Performance Comparison and Effect of Parameters

In Figure 3, the abscissa represents the number of selected features and the ordinate represents the accuracy of using the selected features on the test set. From the experimental results, our DQN method had achieved better results than the other three methods although we adjusted the vector and a reward in DQN discount factor, the final selected feature on the test set arrived the highest accuracy than the other three methods, and in most cases, the accuracy is the same, but our approach requires a smaller number of features. In Figure 4, we also provide the performance comparison of these algorithms in terms of F1 score, precision, and recall metrics. It can be observed that due to the complexity of the F1 method, it exhibits significantly higher computational resource consumption compared to other methods, which may result in certain performance advantages. However, this does not align with practical requirements. In contrast, our method achieves a more balanced trade-off between performance and resource consumption.

It can be observed from Table 1, and achieved a higher accuracy of our method in both the number of features is less than the other three methods, including vector in 0.01, reward the discount factor for 0.9 DQN method selected nearly half, less number of features can be shown using DQN feature selection achieved very good result. At the same time, it can also be found that in this experiment, the smaller the learning rate, the higher the final accuracy, indicating that the DQN network fully and stably learned better experience. However, the smaller the reward discount factor, the DQN network will pay more attention to the current reward, and the number of features selected will be smaller, but the highest possible accuracy rate will be lower.

The specific computation time of different models is influenced by multiple factors such as the size of the dataset, the number of features, and their complexity. Taking the chi-square test method as a reference with its computation time assumed as 1, the computation time for F Test is approximately 1.2 to 1.5 times that of the chi-square test, while the computation time for mutual information is in the range of 1.5 to 2 times.

We fixed the number of selected features and observed the highest accuracy achieved by different methods given the number of features to be selected. As can be seen from the table, the accuracy performance of our method is higher than the other three methods in almost all cases, especially when the feature number is 30. This indicates that our method can not only select the optimal number and combination of features but also obtain higher accuracy when the number of features to be selected is fixed.

In our work, we listed the feature combinations selected by each method when the number of features to be selected was 1, 2, 3, and 5, respectively, and counted the frequency of each feature to analyze the importance of the feature in Table 2. The top five features, from most to least, were(i)AcPayStyle. This is the most important feature of this experiment, showing that stroke patients have a large proportion of reimbursement from rural cooperative medical insurance, indicating that the prevalence, incidence, and mortality of stroke in rural residents are significantly higher than those in urban residents.(ii)DfHypertension. It is the second most important feature affecting the results of the experiment, which is also in line with the prior knowledge of modern medicine. According to statistics, 70% to 80% of stroke patients have high blood pressure, and hypertension can increase the risk of stroke.(iii)Age_4. This is the third most important feature in the experiment, representing people between the ages of 40 and 50. It is also found in the summary of stroke data in China that the population with stroke tends to be younger in the past 40 years, and the experimental results also reflect this fact to a certain extent.(iv)AcJob. One of the important features affecting the results of the experiment has been shown that if people engage in high-intensity mental work for a long time, they have a significantly higher incidence of high blood pressure, which is an important risk factor for stroke, than the average manual worker.(v)DfSportsLack. It also plays a certain role in influencing the results of the experiment, which is related to the lifestyle of patients. Chronic lack of movement will cause fat and cholesterol to stick to the vessel wall, which in turn narrows the walls, leading to slower blood flow, and over time, such blockages can increase the risk of stroke.

6. Conclusion and Future Work

In this paper, we introduce existing feature selection methods first and point out the drawbacks that these methods may ignore the relationships between features. Based on the requirement of this issue, we analyze the feature selection strategy from the perspective of minimizing regret and model feature selection in terms of RL to train the optimal feature combinations by DQN. Based on the theoretical analysis, we propose a practical feature selection strategy RMFS, which aims to select the optimal combination of features. RMFS shows a strong robust to the randomness of the environment and has high computational efficiency and accuracy. Compared with the previous feature selection methods, our method yields superior results. In future work, we will extend our framework and attempt to adjust the buffer size in different training phases since our framework is general. In addition, we will investigate more about the importance and validity of features such as proxy signals.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This paper was supported by the National Natural Science Foundation of China (Grant nos. 62192783 and 62376117) and the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University.