Abstract

Diseases can have a huge impact on the quality of life of the human population. Humans have always been in the quest to find strategies to avoid diseases that are life-threatening or affect the quality of life of humans. Effective use of resources available to human to control different diseases has always been critical. Researchers are recently more interested to find AI-based solutions to control the human population from diseases due to the overwhelming popularity of deep learning. There are many supervised techniques that have always been applied for disease diagnosis. However, the main problem of supervised based solutions is the availability of data, which is not always possible or not always complete. For instance, we do not have enough data that shows the different states of humans and different states of environments, and how all different actions taken by humans or viruses have ultimately resulted in a disease that eventually takes the lives of humans. Therefore, there is a need to find unsupervised based solutions or some techniques that do not have a dependency on the underlying dataset. In this paper, we have explored the reinforcement learning approach. We have tried different reinforcement learning algorithms to research different solutions for the prevention of diseases in the simulation of the human population. We have explored different techniques for controlling the transmission of diseases and its effects on health in the human population simulated in an environment. Our algorithms have found out policies that are best for the human population to protect themselves from the transmission and infection of malaria. The paper concludes that deep learning-based algorithms such as Deep Deterministic Policy Gradient (DDPG) have outperformed traditional algorithms such as Q-Learning or SARSA.

1. Introduction

Different types of diseases such as malaria, flu, dengue, and HIV have a huge impact on the quality of life of the human population [13]. If we consider malaria only, then according to the World Health Organization’s report, approximately 3.2 billion people are infected with malaria. As per their report, in 2016 and 2017, there were 217 and 219 million malaria cases reported, which shows an increase in malaria cases in recent years [4]. Therefore, effective use of resources to get malaria under control has been critical. Insecticide-Treated Nets (ITNs) are the primary method of malaria prevention [5] because there is a type of mosquito called the anopheles mosquito; it bites after 9 p.m. When a mosquito sets on the net, it dies due to the insecticide, which disrupts the reproductive cycle. In addition to ITNs, the other malaria preventive policies include Indoor Residual Spraying (IRS) [6], larvicide [7] in bodies of water, and malaria vaccination [811].

Machine Learning algorithms are applied in different domains and have made tremendous progress [12] where healthcare sector is particularly influenced by machine learning [1315] in the past few years. These machine learning algorithms are focusing on the diagnosis of diseases [16] or forecasting future results [17], but the treatment of diseases is not explored [18]. It is a very important step to diagnose a disease and is considered as an important step to treat diseases, and machine learning techniques can support healthcare professionals in the treatment to some extent, but it has been a challenging problem to find the best policy to treat patients for medical professionals [19]. Recently, much popularity is gained by reinforcement learning (RL) [20] in video games [2123], where good and bad actions are learned by the agent through interactions with the environment and the response of the environment. In the context of video games, RL has performed very well, but limited progress has been made in real-world domains like health care. In video games such as AlphaGo and StarCraft, the agent plays a large number of actions in the environment and learns the optimal policy. However, in the context of health care, it is considered unethical to use humans to train RL algorithms and not to mention that this process would be costly and takes years to complete. We are not able to observe everything happening the body of a person. We can measure blood pressure, temperature, and some other measurements at different intervals of time, but these measurements do not represent the complete state of a patient. Similarly, the data collected in health care about patients may exist for one time and may not exist for others. For example, chest X-rays that are used in the treatment of pneumonia [24] are collected before a person is infected and after the person is cured, but the RL model has to know all the estimates of the states the patient goes through. It is very challenging in health care where there are many unknown facts about patients at all time steps.

Reward function is one of the most important functions in RL, and it is challenging in many real-world applications to find a good reward function. In health care, it is even more challenging to search for the reward function that keeps balance between short-term success and overall long-term improvements. For example, in case of sepsis [25], improvements in blood pressure at different durations of time may not cause improvement in the overall success. Similarly, having only a single high reward at the end of an episode (i.e., survived or died) demonstrates that a long route is followed without different intermediary rewards [26, 27]. It is also difficult to know what actions result in reward and what actions result in penalty. All the major breakthroughs are possible by using simulated data in deep RL that is equal to many actual years [28]. When data are generated through simulators, it is not a problem, but in case of health care, it is not possible to generate simulated data for the treatment of different diseases. Generally, the data are very scarce to start with training supervised learning, and the data that exist take efforts to annotate to be used for supervised learning. Furthermore, hospitals are not willing to share data of patients mainly because of privacy reasons. All these facts further make the application of deep RL to health care challenging.

By nature, the health care data is nonstationary and dynamic [29]. For example, it is possible that patients’ symptoms are stored at different intervals of time and maybe different records are stored for different patients. Over time, the objectives of treatments may also change. In literature, different studies [3032] are focused on reducing the overall mortality. When the condition of a person improves, the focus shifts to a different objective such as the duration of the virus staying in the body. Similarly, viruses or infections may change much more rapidly and may evolve in different dynamics [3335] that are most probably not observed in the training data used for supervised or semisupervised learning algorithms. Decision-making in medical diagnosis is inherently sequential [36, 37]. It means that a patient visits a health care centre for the treatment of a disease. The doctor, based on previous experiences, decides a treatment to be followed. Later, when the patient returns to the same doctor, the treatment that was previously suggested by the doctor decides the current state of the patient and also helps the doctor in which decision needs to be taken next. In the existing state-of-the-art AI strategies of dealing with disease treatment [38, 39], the sequential nature of the decisions is ignored [40]. These AI systems make decisions on the basis of the present state of the patients. The sequential nature of medical treatment can be effectively modelled as Markov Decision Process (MDP) [4144] and better solved through RL. The RL algorithms will not only consider the instantaneous outcomes of treatment but also the long-term benefits of the patients [45].

An intervention of actions to avoid malaria are systematically explored in this paper. The paper demonstrates a real-world example of reinforcement learning, where simulated humans are trained to learn an effective technique to avoid malaria. In the literature, AI techniques are used for the prediction, diagnosis, and healthcare planning, but this paper takes a different approach by simulating an environment and using simulated humans to use different reinforcement learning techniques to avoid malaria. A combination of interventions is explored to control the transmission of malaria and learn techniques for malaria avoidance.

The paper is organized as follows: the related works are explained in Section 2. The problem of malaria avoidance and the methodology of reinforcement learning are given in Section 3. Experiments are performed, and their results are analysed in Section 4. Concluding remarks of the paper are given in Section 5.

Recent advancements in machine learning and big data have motivated researchers of different domains to use these algorithms in their problems. Biomedical and health care researchers are getting benefits from these algorithms in early disease recognition, community services, and patients care. In [46], machine learning and MapReduce algorithms are used to effectively predict different diseases in disease-frequent societies. The paper demonstrated to achieve 94.8% accuracy and convergence speed that is faster than CNN (Convolutional Neural Network) based algorithms. Similarly, deep learning and big data techniques have been used in [47] to predict infectious diseases. The authors have combined Deep Neural Network (DNN) and Long Short-Term Memory (LSTM) and evaluated the performance with Autoregressive Integrated Moving Average (ARIMA) in making the prediction of different diseases one week in the future. Better results have been achieved compared to ARIMA. Automatic diagnosis of malaria enables us to provide reliability in health care services to areas where resources are limited. Machine learning techniques have been tried to investigate the process of automating malaria detection. In [48], malaria classification is performed using CNN. Similarly, in [49], CNN has been used to detect malaria classification and has demonstrated promising accuracy. Deep reinforcement learning (DRL) has recently attained remarkable success, notably in complex games like Atari, Go, and Chess. These achievements are mainly possible because of the powerful function approximation with the help of DNN. DRL has been proved as an effective method in the medical context. Several applications of RL have been found in the context of medicine. For instance, RL methods have been used to develop strategies of treatment for epilepsy [50] and lung cancer [51]. Authors have used the sepsis dataset which is a subset of the MIMIC-III dataset [25]. An action space consisting of vasopressors and IV fluid is selected. Each drug of varying amount is grouped in four bins. Double Deep Q-Network is used for the evaluation. SOFA score which is used for measurements of organ failure is used for the reward function. U-curve is used for evaluation. The mortality rate is used as a function of dosage of policy prescription versus the policy that is actually followed.

In [19], DRL is used to develop a framework that predicts an optimal strategy to deal with Dynamic Treatment Regimes using medical data. The paper has claimed that their RL model is more flexible and adaptive in high dimensional action and state spaces compared to other RL based approaches. The framework models real-world complexity in helping doctors and patients to make a personalized decision in making treatment choices and disease progression. The framework combines supervised learning and DRL using DNN. The dataset is taken from the database of the Centre for International Bone Marrow Transplant Research (CIBMTR) registry. The framework has demonstrated achieving promising accuracy to predict a human doctor’s decision and at the same time compute a high reward function. In [52], an RL system is developed that helps diabetes patients to engage in different physical activities. Messages sent to patients were made personalized to patients and the results have demonstrated that participants receiving messages with the RL algorithm increased the number of physical activities and walking speed. A supervised RL with recurrent neural network (SRL-RNN) is combined in a framework to make different treatment recommendations by Wang et al. in [53]. Their results of experiments conducted on MIMIC-3 dataset have demonstrated that the RL based framework can reduce the estimated mortality and at the same time provide promising accuracy to match doctor’s prescriptions. In [54], the authors describe a novel technique that can find the optimal policy that can treat patients with chemo using RL. The authors have used Q-Learning, and, for the action space, a mechanism is used to quantify doses for a given time period that an agent can choose from. The cycle of dose is initiated with a frequency as determined by an expert. At the end of each cycle, transition states are compared. The mean reduction in tumour diameter determines the reward function. Simulated clinical trials are used for the evaluation of the algorithm.

In [55], the authors have taken a different approach that uses the RL techniques to encourage healthy habits instead of looking for direct treatment. In [56], the authors focus on sepsis and RL, but a different approach is taken that uses the RL techniques to control glycemic. In [57], the authors have focused on counterfactual inference and domain adversarial Neural Networks. It is a complicated problem to solve the problem of decision-making under uncertainty. Health care practitioners are facing problems under challenging constraints, with limited tools to make data driven decisions. In [58], the authors have solved the problem of finding an optimal malaria policy as a stochastic multiarmed bandit problem and have developed three agent-based strategies to explore the space of policies. A Gaussian Process regression is applied to the finding of each agent, for compression and for stochastic results from simulating the spread of malaria in a fixed population. The policy generated by the simulation is compared with human experts in the field for direct reference. In [59], the authors have exposed subtleties associated with evaluating RL algorithms in health care. The focus is on the observational setting where RL algorithms have proposed a treatment policy and been evaluated based on historical data. A survey in [60] discusses the different applications of reinforcement learning in health care. The paper provides a systematic understanding of theoretical foundations, methods and techniques, challenges, and new insights into emerging directions. A context aware hierarchical RL scheme [61] has been shown to significantly improve the accuracy of symptom checking over traditional systems while reducing the number of inquiries. Another study that introduces basic concepts of RL and how RL could be effectively used in health care is given in [62].

Policy for malaria control using the reinforcement learning algorithm is explained in [63, 64]. The authors have applied the Genetic Algorithms [65], Bayesian Optimization [66], and Q-Learning with sequence breaking to search for optimal policy for a few years. Their experiments demonstrated the best performance by Q-Learning algorithm. A systematic review of agent-based models for malaria transmission is given in [67]. The paper covers an extensive array of topics covering the spectrum of transmission and intervention of malaria. Machine learning algorithms for the prediction of different diseases are studied in [68]. The authors have used Decision Tree and MapReduce algorithms and have claimed to achieve 94.8% accuracy. Machine learning algorithms have been used to automatically diagnose malaria in [69]. Deep Convolutional Neural Networks have been used for classification. The authors in [70] have discussed safety applications related to AI in those domains where deep reinforcement learning is applied to the control of automatic mobile robots. An investigation of the risk associated with malaria infection to identify those bottlenecks in different malaria elimination techniques is discussed in [71]. Other relevant studies can be found in [7274].

3. Methodology

Reinforcement learning (RL) [75] is an example of machine learning methods falling between supervised and unsupervised learning, where an agent learns by interacting with the environment. The agent performs certain actions and receives feedback from the environment. This feedback is in the form of negative or positive reward and determines the sequence of good or bad actions to be adapted within a particular situation. As a result, the agent can perform its operation efficiently without any intervention from a human. In other words, RL is a learning method where an agent learns a sequence of actions to eventually increase the reward function. The agent decides which action is the most appropriate and yields a maximum reward. It is possible that an action may not give a positive immediate reward but the long-term reward is also considered. In RL, we have two components, that is, agent and environment as shown in Figure 1. The agent represents the type of RL algorithm, and the environment represents what action returns which reward. The environment is established by sending a state at time t as St ∈ S, where S is the representation of the set of possible states to the agent. The action taken by the agent at time t is represented by At ∈ A (St), where A (St) is the representation of the set of actions possible to be taken at state St. The reward to be received by performing that action is represented as Rt+1 ∈ R, where R is the set of rewards. After one time-step, the next state St+1 will be sent to the agent by the environment along with reward Rt+1. This reward will eventually help the agent increase its knowledge to be used in evaluating its last action. This process of sending state and receiving reward as an outcome by the agent continues until the environment sends the last or terminal state to the agent.

In addition to the agent and environment, there are four components in a RL environment: (i) policy, (ii) reward, (iii) value function, and (iv) model of the environment.(1)Policy. A policy defines the behaviour/reaction of an agent at a particular instance of time. Sometimes, a policy can be described as a simple function or as a lookup table, where a policy may involve a lot of computation, for example, the searching process. The policy is considered as a central part of the RL agent because it alone can describe the reaction of the agent. The policy may be stochastic, to determine possibilities for every action. The policy is represented by πt, where πt (a | s) demonstrates the probability of At = a if St = s(2)Reward. A reward signal indicates the target of an RL problem. As a result of an action taken by the agent, the environment returns a number, called a reward, at every time step. The objective of the agent is to get most of the total reward over time. Thus, the reward signal identifies that an action is good or bad. The rewards signal determines the action to be taken. If an action returns a low reward, then the policy will be changed to select another action in a similar situation. So generally, a reward signal is the stochastic function of the state and action.(3)Value Function. A reward signal identifies what is good at the current time, while a value function describes what is good in the long run. In almost all RL algorithms, the most important component to be considered is the method to efficiently estimate the values. More precisely, the current value of the earlier state is adjusted to be closer to the value of the later state. This can be done by moving the earlier state’s value a fraction toward the value of the later state. Let s denote the state before the move, and s is the state after the Agent Environment moves; then, the update to the estimated value of s, denoted as , can be written as shown in equation (1), where α′ is a small positive fraction called the step-size parameter, which influences the rate of learning. is called Temporal Difference target and is an unbiased estimate for . In equation (1), r represents reward and γ represents the discounting factor. This update rule is an example of a Temporal Difference learning method, called so because its changes are based on a difference, , that is, difference between estimates at two different times:(4)Model. A model allows inferences of the actions in an environment. Suppose a state and action are given; then, it is possible that the model determines the resultant next state and reward. The methods that use the models and planning to solve RL Problems are known as model-based methods. Those techniques which are explicitly trail-and-error learner are called model-free methods.

Let us assume that there are finite states and rewards. Let us consider an environment that may respond at time t + 1 to the action taken at time t. This response actually depends on everything that happened earlier. The complete probability distribution of the dynamics of the system can be defined in equation (2), for all r, S, and all possible values of the actions in the past represented in the form of action, states, and rewards, that is, St, At, and Rt. However, due to the Markovian property, we can represent the response of the environment at t + 1 that depends only on the state and action at time t. The dynamics of the environment can be defined as given in equation (3), for all r, s′, St, and At. It means that a state or an environment has a Markovian property if and only if equations (2) and (3) are equal. The Markovian property is very important in RL, as decisions and values are a function of the current state. These decisions and values can be effective and carry more information when the state representation carries enough information:

The task of RL that satisfied the Markovian property is known by the name Markov Decision Process (MDP). Given state s and action , the computation of probability of next state s′ along with reward r is denoted as given in equation (4). The expected value of rewards for the state-action pairs can be computed given in equation (5). The expected rewards for state-action-next-state is given in equation (6):

Value functions, which is a function of states or state-action pairs, are used to estimate the performance of an agent in a given state. This performance is computed in terms of future rewards to be collected. The state value is denoted by Vπ(s) given a policy π and state s and is computed as shown in equation (7), where Eπ [.] represents the expectation of variable when an agent follows a policy π at time step t. Similarly, the action value of a state s following a policy π represented by qπ(s, a) is given in equation (8), where qπ is the function of action-value when π policy is used:

RL problem is solved by searching for a policy that helps the agent to collect maximum possible rewards over the execution of the simulation. A given policy π is treated as a better policy or equal to another policy π′, it the expectation of the π is greater or equal to the expectation of π′ for all states. In other words, π ≥ π′ if and only if  ≥  ∀ s ∈ S. An optimal policy is the policy that is considered good or equal to all possible policies. Optimal policies are represented by π. The same state-value function is shared by optimal policies as V and defined as V(S) = max Vπ(S) ∀ s ∈ S. They also share same optimal action-value function, represented by q defined as q(s, a) = max qπ(s, a) ∀ s ∈ S and a ∈ A(s).

The model-based RL means the simulation of the dynamics of a given environment. The model learns the probability of moving from the current state s0, taking action and ending in next state s1. Given the learning of transition probability, the agent can determine the probability to enter a state given the current state and action. However, model-based algorithms are not practical because the state space and action space grow. On the other side, the model-free algorithms depend on trial-and-error to update its knowledge. Therefore, space is not required to store all combination of states and actions. In this paper, we are using model-free algorithms. Classification of RL algorithms are made based on on-policy and off-policy. When the value is based on the current action a and derived from the current policy, it is known as on-policy. When an action a is obtained from a different policy, then it is known as off-policy.

3.1. Q-Learning

A well-known algorithm in RL is Q-Learning developed by Watkins [76]. Its proof of convergence is given by Jaakkola [77]. Q-Learning is a simple technique, and it can compute optimal action value without the involvement of intermediary evaluation of cost and the usage of a model [78]. This algorithm is model-free and is considered as off-policy algorithm, which is derived from Bellman Equation as shown in equation (9), where expectation is given by E and discounting factor is represented by λ. This update equation is shown in Algorithm 1 on line 10. Learning rate is represented by α. The next state’s Q value determine the next action a instead of using the current policy. The overall objective of the algorithm is to maximize the Q-value:

Input:
States: S = 1, …, n
Actions: A = 1, …, n
Rewards: R: S × A ⟶ R Transitions: T: S × A ⟶ S
α ∈ [0, 1] and γ ∈ [0, 1]
Randomly Initialize Q (s, a) ∀ s ∈ S, a ∈ A (s)
while For every episode do
    Initialize S ∈ S
    Select a from s on the basis of exploration strategy (e.g. ε-greedy)
    while For every step in the episode do
       //Repeat until s is terminal
      Compute π on the basis of Q and strategy of exploration (e.g. π (s) = argmaxaQ (s, a))
      a ⟵ π (s)
      r ⟵ R (s, a)
      s ⟵ T (s, a)
      Q (s, a) ⟵ (1 − α).Q (s, a) + α [r + Q (s, a)]
      s ⟵ s
3.2. SARSA

A similar algorithm to Q-Learning is SARSA [79, 80]. In case of Q-Learning, greedy policy is followed, but in case of SARSA on-policy is followed. SARSA learns Q-value by performing actions using the current policy. Algorithm 2 shows the algorithm of SARSA. Current policy is used to carry out selection of actions.

Input:
States: S = 1, …, n
Actions: A = 1, …, n
Rewards: R: S × A ⟶ R
Transitions: T: S × A ⟶ S
α ∈ [0, 1] and γ ∈ [0, 1]
λ ∈ [0, 1] this shows the trade-off between Temporal Difference and Monte Carlo methods.
Randomly Initialize Q (s, a) ∀ s ∈ S, a ∈ A (s)
while For every episode do
    Randomly initialize s ∈ S
    Initialize e with 0
    Randomly select (s, a) ∈ S × A
    while For every step in the episode do
      //Repeat until s is terminal
      r ⟵ R (s, a)
      s′ ⟵ T (s, a)
      Compute π based on Q using exploration strategy (e.g. ε-greedy)
      a′ ⟵ π (s′)
      e (s, a) ⟵ e(s, a) + 1
      δ ⟵ r + γ.Q (s′, a′) − Q (s, a)
      for (s′, a′) ∈ S × A do
         Q (s′, a′) ⟵ Q (s′, a′) + α.δ.e (s′, a′)
         e (s′, a′ ⟵ γ.λ.e (s′, a′))
      s ⟵ s′
      a ⟵ a
3.3. Deep Deterministic Policy Gradient

An actor-critic architecture is called Deep Deterministic Policy Gradient (DDPG) [81, 82]. The parameter x is tuned for policy by actor as given in equation (10). Using Temporal Difference error, the policy computed by the action is evaluated by critic as demonstrated in equation (11). The policy decided by the actor is shown by . The idea of experience replay and separate target network as utilized by Deep Q Network (DQN) [83] is used by DDPG. Algorithm 3 shows the algorithm of DDPG.

(1)Randomly initialize critic network with weight
(2)Randomly initialize actor with weight
(3)Initialize target network with weight
(4)Initialize target network with weight
(5)Initialize replay buffer
(6)while For every episode do
(7)  Randomly initialize for exploration
(8)  Get initial observation state
(9)  while For every step in the episode do
     //Repeat until s is terminal
(10)    Section action as per the current policy and exploration strategy
(11)    Perform action and monitor rewards and new state
(12)    Store in
(13)    Sample a randomly selected minibatch of transition from
(14)    
(15)    
  //Update rule for critic to minimize the loss
(16)  
  //Update rule for actor policy using the sampled policy gradient
(17)  
  //Update rule for target network
(18)  

4. Simulation and Discussion

In this section, we present the results of algorithms explained in Section 3 obtained in a simulated human population and see which algorithm performs better to prevent humans from diseases. For the evaluation, we need an environment where we have different states, actions, and agents (representative of human population) looking for the best policy to avoid diseases such as malaria, flu, and HIV. In this section, results are shown for malaria avoidance only, but similar environment with sufficient information can be used for the avoidance of other types of diseases such as flu, HIV, and dengue. An environment where a human, mosquito, and other factors that can influence the transmission of malaria virus to spread to human is shown in Figure 2. The box on the left contains factors relevant to human and the box on the right contains factors pertaining to mosquitoes. Different factors that can influence the disease are shown inside the arrows linking the boxes for humans and mosquitoes. Environment factors and interventions are shown on the top and bottom of the boxes for human and mosquitoes.

The IBM Africa research team has taken steps to control malaria by developing a world-class environment to distribute bed nets and repellents. Their goal is to develop a custom agent that will help identify the best policies for rewards based on the simulation environment. Our work leverages the environment developed by IBM Africa research for reinforcement learning competition on hexagon-ml (https://compete.hexagon-ml.com/practice/rl_competition/38/) where an agent learns the best policy for the control of diseases, that is, malaria. The environment provides stochastic transmission models for malaria and different researchers can evaluate the impact of different malaria control interventions. In the environment, an agent may explore optimal policies to control the spread of the malaria virus. A diagram representing the environment developed by Hexagon-ML for finding the best policy for avoiding malaria is given in Figure 3. The environment contains five years. Every year is a state. At every state, we take different actions in the form of ITN and IRS.

States are represented as S ∈ {1, 2, 3, 4, 5}, where each number shows the number of the year. We are trying to solve the problem of making one-shot policy recommendations for the simulation intervention period of 5 years. The main control methods used in different regions are mass-distribution of long-lasting ITNs, IRS with pyrethroids, and the prompt and effective treatment of malaria. Actions, represented by A (s), are performed in the form of ITN and IRS, where the values of ITN and IRS are infinite real numbers between 0 and 1.

The agent trained on a reinforcement learning algorithm will explore a policy space made up of the first two components, that is, ITNs and IRS, which are strategies for direct intervention. The prompt and effective treatment is given by the environment parameters and impacts the rewards. The first component. That is, ITN, is the development of nets, defining the population coverage (aITN ∈ (0, 1]). The second component is the use of seasonal spraying, and it defines the proportion of population coverage for this intervention (aIRS ∈ (0, 1]). The seasonal spraying is performed through alternating the intervention between April and June every year in different regions. The policy decision is framed in a way of the simulated population to be covered by a particular intervention; the space of policy A is designed through ai ∈ A = [aITN, aIRS].

Health care organizations should be able to explore all possible set of actions for appropriate malaria interventions within the populations. These policies include a mix of actions, like the distribution of ITNs, IRS, larvicide in water, and vaccination for malaria control. The space of possible policies for the control of malaria is not complete and inefficient for health care experts to explore without an adequate decision support system. The environment in simulation handles the distribution of the interventions in the simulated population. The agent is in charge of the complex actions of targeted interventions, which are not reported previously. Although the action space is finite (i.e., finite number of people in the simulation environment) the space size grows exponentially as more interventions are added. The computation time of simulation will also grow linearly with the number of populations. Therefore, a complex exploration of the entire action space becomes impossible as complexity goes to a real-world equivalent simulation.

The agent learns different rewards during the learning process. The idea of learning is to collect as much reward as possible during the process of execution of the experiment. These rewards are infinite and usually represented by Rπ ∈ (−∞, +∞), where the policy is represented by π. Every policy is associated with a reward represented by Rθ(ai) and is a stochastic parameterization of the simulation shown as θ which produces random distribution of parameters for the simulated environment.

The environment is executed for 100 episodes, and rewards are collected. An episode consists of five consecutive years. The rewards collected by different algorithms are demonstrated in Figure 4. The random selection algorithm when there is no learning for 100 episodes is given in Figure 4(a). In random policy learning, every time one episode is finished, the environment is initiated with different random states and different policy is tried at random to go from one state to another to collect rewards. In this algorithm, no learning is involved, and this experiment is performed only to show a base line for comparison with other algorithms. The Q-learning algorithm is shown in Figure 4(b). Compared to random search algorithm, this algorithm has shown improvements as the agent is learning through Q-learning mechanism to collect rewards in the learning process. SARSA algorithm is used, and the result of reward collection is shown in Figure 4(c). The SARSA trained agents are used to look to policy to avoid malaria in a simulated human environment and has shown improvements over simple Q-learning algorithm. An even more sophisticated algorithm known as DDPG is used in the environment to collect rewards, and results are demonstrated in Figure 4(d). This algorithm shows improvements compared to all other three algorithms and demonstrated that deep learning methods can potentially collect better results in reinforcement learning algorithms.

We have combined the results of the algorithms trained in this paper in Figure 5. In random searching process, there is no learning, and therefore reward is not maximized. But in other algorithms such as Q-learning, SARSA, and DDPG, there is learning involved, and therefore reward is maximized. The overall rewards collected by different algorithms are combined in one figure (Figure 5(b)). The maximum rewards are collected by DDPG because a complex algorithm is used for collection of rewards. This comparison of three algorithms is shown in Table 1. This comparison demonstrates the best policy obtained by operating in the environment to avoid malaria and the related reward collected by performing the best policy. This table demonstrates that DDPG has outperformed traditional learning algorithms.

5. Conclusion

Since the development of human civilizations, humans have always been in the quest to improve the quality of life from different perspectives. We are looking for the most comfortable accommodation, fast and secure transport, clean and healthy food, comfortable clothes, and many other things. But because of the environmental changes and different actions taken by humans, there are possibilities of different viruses entering the body of humans and affecting the quality of life of humans. For instance, malaria, flu, HIV, and dengue are some diseases that not only affect a single individual but also can affect the whole population, as the virus spreads from one person to another person. Humans over time have learned different methods to treat these diseases. There are doctors, who prescribe medicine to treat diseases, and hence diseases are in control. But the problem is that the decision of a doctor requires a huge amount of knowledge and experience, to effectively cure a disease. We think it is possible that the human effort is minimized, and some AI-based solutions are explored. Different AI-based solutions have also been explored by researchers, in the form of supervised learning such as ANN, KNN, and SVM. However, the problem with these supervised learning is that the model is trained on the existing data to make similar decisions when a similar data is presented as testing. There is a huge gap to further generalize the solution. Therefore, unsupervised learning algorithms and reinforcement learning are becoming popular. In this paper, we have explored reinforcement learning-based algorithms, where an agent interacts with the environment to get feedback and improves its state of knowledge. We have experimented with three different algorithms in reinforcement learning. These algorithms are Q-Learning, SARSA, and DDPG. All these algorithms perform better than random search, as there is learning involved. Q-Learning and SARSA are based on traditional methods of reinforcement learning. However, because of the popularity of deep learning, researchers are interested in introducing deep learning in reinforcement learning. DDPG is a deep learning-based algorithm. Our experiments have demonstrated that deep learning-based algorithms are the most suitable algorithm for such type of complex environment, where human, their actions, environments, and their feedback play a very important role.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This project was funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, under Grant no. DF-458-156-1441. The authors, therefore, gratefully acknowledge DSR technical and financial support.