Inventory management is a sequential decision problem that can be solved with reinforcement learning (RL). Although RL in its conventional form does not require domain knowledge, exploiting such knowledge of problem structure, usually available in inventory management, can be beneficial to improving the learning quality and speed of RL. Ruminative reinforcement learning (RRL) has been introduced recently based on this approach. RRL is motivated by how humans contemplate the consequences of their actions in trying to learn how to make a better decision. This study further investigates the issues of RRL and proposes new RRL methods applied to inventory management. Our investigation provides insight into different RRL characteristics, and our experimental results show the viability of the new methods.

1. Introduction

Inventory management is a crucial business activity and can be modeled as a sequential decision problem. Bertsimas and Thiele [1], among others, addressed the need for an efficient and flexible inventory solution that is also simple to implement in practice. This may be among the reasons for extensive studies of reinforcement learning (RL) application to inventory management.

RL [2, 3] is an approach to solve sequential decision problems based on learning the underlying state value or state-action value. Relying on learning mechanism, RL in its typical form does not require knowledge of a structure of the problem. Therefore, RL has been studied in wide range of sequential decision problems, for example, virtual machine configuration [4], robotics [5], helicopter control [6], ventilation, heating and air conditioning control [7], electricity trade [8], financial management [9], water resource management [10], and inventory management [11]. Acceptance of RL is credited to RL’s effectiveness, potential possibilities [12], link to mammal learning processes [13, 14], and its model-free property [15].

Despite fascination with RL’s model-free property, most inventory management problems can naturally be formulated into a well-structured part interacting with another part that is less understood. That is, replenishment cost, holding cost, and penalty cost can be determined precisely in advance. On the other hand, customer demand or, in some cases, delivery time or availability of supplies is usually less predictable. However, once a value of a less predictable variable is known, the period cost can be determined precisely. Specifically, a warehouse would know its period inventory cost after its replenishment has arrived and all demand orders in the period have been observed. Calculation of a period cost is a well-defined formula, while another part, for example, demand, is less predictable. Knowledge about the well-structured part can be exploited, while a learning mechanism can be used to handle the less understood part.

Utilizing this knowledge, Kim et al. [16] proposed asynchronous action-reward learning, which used simulation to evaluate consequences of actions not taken in order to accelerate the learning process in a stateless system. Extending the idea to state-based system, Katanyukul [17] developed ruminative reinforcement learning (RRL) methods, that is, ruminative SARSA (RSarsa) and policy-weighted RSarsa (PRS). The RRL approach is motivated by how humans contemplate consequences of their actions to improve their learning hoping to make a better decision. His study of RRL reveals good potential of the approach. However, existing individual methods show strengths in different scenarios: RSarsa is shown to have fast learning but leads to inferior learning quality in a long-term run. PRS is shown to lead to superior learning quality in a long-term run, but with slower rate.

Our proposed method here is developed to exploit the fast learning characteristic of RSarsa and good learning quality in a long-term run of PRS. Our experimental results show effectiveness of the proposed method and support our assumption underlying development of RRL.

2. Background

An objective of a sequential inventory management is to minimize a long-term cost, , subject to for , where is the expected period cost of period given an initial state and actions over periods to , respectively; is a discount factor; and is a feasible action set at state . Under certain assumptions, the problem can be posed as a Markov decision problem (MDP) (see [15] for details). In this case, what we seek is an optimal policy, which maps each state to an optimal action. Given an arbitrary policy , the long-term state cost for that policy can be written as where is an expected period state cost, is a transition probability—the probability of the next state being when the current state is . The superscript notation indicates dependence on the policy . In practice, exact solution to (1) is difficult to find. Reinforcement learning (RL) [2] provides a framework to find an approximate solution. An approximate long-term cost of state is obtained by summation of the period cost and a long-term cost of the next state .

The RL approach is based on temporal difference (TD) learning, which uses temporal difference error (2) to estimate the long-term cost (3): where is the period cost, which corresponds to taking action in state , is a learning rate, and and are the state and action taken in the next period, respectively.

Once the values of are thoroughly learned, they are good approximations of long-term costs. We often refer to as the “Q-value.” Most RL methods determine the actions to take based on Q-values. These methods include SARSA [2], a widely used RL algorithm. We use SARSA as a benchmark, representing a conventional RL method, to compare with other methods under investigation. In each period, given observed state , action taken , observed period cost , observed next state , and anticipating next action taken , the SARSA algorithm updates the Q-value based on TD learning (2) and (3).

Based on the Q-value, we can define a policy to determine an action to take at each state. The policy is usually stochastic, defined by a probability to take an action given a state . The policy has to balance between taking the best action based on the currently learned Q-value and trying another alternative. Trying another alternative gives the learning agent a chance to explore thoroughly the consequences of its state-action space. This helps to create a constructive cycle of improving the quality of learned Q-values, which in turn will help the agent to choose better actions and reduce the chance to get stuck in a local optimum. This is an issue of balancing between exploitation and exploration, as discussed in Sutton and Barto [2]. (Since the RL algorithm is autonomous and interacts with its environment, we sometimes use the term “learning agent.”).

An -greedy policy is a general RL policy, which also is easy to implement. With probability , the policy chooses an action randomly from , where is a set of allowable actions given state . Otherwise, it takes an action corresponding to the minimal current Q-value, .

3. Ruminative Reinforcement Learning

The conventional RL approach, SARSA, assumes that the agent knows only the current state , the action it takes, the period cost , the next state , and the action it will take in the next state. Each period, the SARSA agent updates the Q-value based on the TD error calculated with these five variables. Figure 1 illustrates the SARSA agent, the five variables it needs to update the Q-value, and its interaction with its environment.

However, in inventory management problems, we usually have extra knowledge about the environment. That is, the problem structure can naturally be formulated such that the period cost and next state are determined by a function , where is an extra information variable. This variable captures the stochastic aspect of the problem. The process generating may be unknown, but the value of is fully observable after the period is over. Given a value of , along with and , the deterministic function can precisely determine and .

Without this extra knowledge, each period, the SARSA agent updates only one value of corresponding to current state and action taken . However, with the function and an observed value of , we can do “rumination”: evaluating the consequences of other actions , even those that were not taken. Figure 2 illustrates rumination and its associated variables. Given the rumination mechanism, we can provide information required by SARSA’s TD calculation for any underlying action. Katanyukul [17] introduced this rumination idea and incorporated it into the SARSA algorithm, resulting in the ruminative SARSA (RSarsa) algorithm. Algorithm 1 shows the RSarsa algorithm. It should be noted that RSarsa is similar to SARSA, but with inclusion of rumination from line 8 to line 13.

(L00) Initialize .
(L01) Observe .
(L02) Determine by policy .
(L03) For each period,
(L04)  observe , and ;
(L05)  determine by policy ;
(L06)  calculate ;
(L07)  update ;
(L08)  for each ,
(L09)   calculate with ,
(L10)    determine ,
(L11)     calculate ,
(L12)    update
(L13)  until ruminated all ;
(L14)  set and
(L15) until termination.

The experiments in [17] showed that RSarsa had performed significantly better than SARSA in early periods (indicating faster learning), but its performance was inferior to SARSA in later periods (indicating poor convergence to the appropriate long-term state cost approximation). Katanyukul [17] attributed RSarsa’s poor long-term learning quality to its lack of natural action visitation frequency.

TD learning (2) and (3) update the Q-value as an approximation of the long-term state cost. The transition probability in (1) does not appear explicitly in the TD learning calculation. Conventional RL relies on sampling trajectories to reflect the natural frequency of visits to state-action pairs corresponding to the transition probability. It updates only the state-action pairs as they are actually visited; therefore, it does not require explicit calculation of the transition probability and still eventually converges to a good approximation.

However, because RSarsa does rumination for all actions ignoring their sampling frequency, this is equivalent to disregarding the transition probability, which leads to RSarsa’s poor long-term learning quality.

To address this issue, Katanyukul [17] proposed policy-weighted RSarsa (PRS). PRS explicitly calculates probabilities of actions to be ruminated and adjusts the weights of their updates. PRS is similar to RSarsa, but the rumination update (line 12 in Algorithm 1) is replaced by where and is the probability of taking action in state with policy . Given an -greedy policy, we have for and otherwise, where is a number of allowable actions. PRS has been shown to perform well in early and later periods, compared to SARSA. However, RSarsa is reported to significantly outperform PRS in early periods.

4. New Methods

According to the results of [17], although RSarsa may converge to a wrong approximation, RSarsa was shown to perform impressively in the very early periods. This suggests that if we jump-start the learning agent with RSarsa and then later switch to PRS, before the Q-values settle into bad spots, we may be able to achieve both faster learning and good approximation for a long-term run.

PRS.Beta. We first introduce a straightforward idea, called PRS.Beta, where we will use a varying ruminative learning rate as a mechanism to shift from full rumination (RSarsa) to policy-weighted rumination (PRS). Similar to PRS, the rumination update is determined by (4). However, the value of the rumination learning rate is determined by where is a function having a value between and . When , and the algorithm will behave like RSarsa. When , and the algorithm will behave like PRS. We want to start out close to and grow to at a proper rate. By examining our preliminary experiments, the TD error will get smaller as the learning converges. This is actually a property of TD learning. Given this property, we can use the magnitude of the TD error to control the shifting, such that where is a scaling factor. Figure 3 illustrates the effects of different values of . Since the magnitude of should be relative to , we set , so that the magnitude of will be in a proper scale relative to and automatically adjusted.

RSarsa.TD. Building on the PRS.Beta method above, we next propose another method, called RSarsa.TD. The underlying idea is that since SARSA performs well in a long-term run (see [2] for theoretical discussion of SARSA’s optimality and convergence properties), then after we speed up the early learning process with rumination, we can just switch back to SARSA. This approach is to utilize the fast learning characteristic of full rumination in early periods and to avoid its poor long-term performance. In addition, as a computational cost of rumination is proportional to the size of the ruminative action space , this also helps to reduce the computational cost incurred by rumination. It is also intuitively appealing in the sense that we do rumination only when we need it.

The intuition to selectively do rumination was introduced in [17] in an attempt to reduce the extra computational cost from rumination. There, the probability to do rumination was a function of the magnitude of the TD error: However, Katanyukul [17] investigated this selective rumination only with the policy-weighted method and called it PRS.TD. Although PRS.TD was able to improve the computational cost of the rumination approach, the inventory management performance of PRS.TD was reported to have mixed results, implying that incorporation of selective rumination may deteriorate performance of PRS.

This performance deterioration may be due to using with policy weighted correction. Both schemes use to control their effect of rumination; therefore, they might have an effect equivalent to overcorrecting the state-transition probability. Unlike PRS, RSarsa does not correct the state-transition probability. Incorporating selective rumination (7) will be the only scheme controlling rumination with . Therefore, we expect that this approach may allow the advantage of RSarsa’s fast learning, while maintaining the long-term learning quality of SARSA.

5. Experiments and Results

Our study uses computer simulations to conduct numerical experiments on three inventory management problem settings (P1, P2, and P3). All problems are periodic review single-echelon with nonzero setup cost. P1 and P2 have one-period lead time. P3 has two-period lead time. The same Markov model is used to govern all problem environments, but with different settings. For P1 and P2, the problem state space is , for on-hand and in-transit inventories: and , respectively. P3’s state space is , for and in-transit inventories and . The action space is , for replenishment order .

The state transition is specified by where is a number of lead time periods.

The inventory period cost is calculated from the equation where , , , and are setup, unit, holding, and penalty costs, respectively, and is a step function. Five RL agents are studied: SARSA, RSarsa, PRS, RSarsa.TD, and PRS.Beta.

Each experiment is repeated 10 times. In each repetition, an agent is initialized with all zero Q-values. Then, the experiment is run consecutively for episodes. Each episode starts with initial state and action as follows: for all problems, and are initialized with values randomly drawn between and . In P1, is initialized to ; in P2, is initialized from randomly drawn values between and ; in P3, is initialized to and randomly drawn values of between and . Each episode ends when periods are reached or an agent has visited a termination state, which is a state lying outside a valid range of Q-value implementation. The maximum number of periods in each episode, , defines the length of the problem horizon, while the number of episodes specifies a variety of problem scenarios, that is, different initial states and actions.

Three problem settings are used in our experiments. Problem 1 (P1) has , , , , , and . Demand is normally distributed, with mean and standard deviation , denoted as . The environment state is set as the RL agent state . Problem 2 (P2) has , , , , , and , with demand . The RL agent state is set as the inventory level . Therefore, the RL agent state is one-dimensional. Problem 3 (P3) has , , , , , and . The demand is AR1/GARCH(1,1): ; and , where and are AR1 model parameters; , , and are GARCH(1,1) parameters; and is white noise distributed according to . The values of AR1/GARCH(1,1) in our experiments are , , , , and , with initial values , , and . The RL agent state in P3 is three-dimensional . In all three problem settings, the RL agent period cost and action are the inventory period cost and replenishment order, respectively. For RSarsa, PRS, RSarsa.TD, and PRS.Beta, the extra information required by rumination is the inventory demand variable .

The Q-value is implemented using grid tile coding [2] without hashing. Tile coding is a function approximation method based on a linear combination of weights of activated features, called “tiles.” The approximation function with argument is given by where are tile weights and are tile activation functions only when lies inside the hypercube of the th tile.

The tile configuration, that is, , is predefined. Each Q-value is stored using tile coding through the weights. Given a value to store at any entry of , the weights are updated according to where and are the weight (of the th tile) and approximation before the new update. Variable is for a number of tiling layers.

For P1, we use a tile coding with 10 tiling layers. Each layer has three-dimensional tiles, covering multidimensional state-action space of corresponding to and . This means that this tile coding allows only a state lying in and a value of action between and . The dimensions, along , , and , are partitioned into 8, 3, and 4 partitions, creating three-dimensional hypercubes for each tiling layer. All layers are overlapping to constitute an entire tile coding set. Layer overlapping is arranged randomly. For P2, we use a tile coding with 5 tiling layers. Each tiling has two-dimensional tiles, covering the space of corresponding to and . For P3, we use a tile coding with 10 tiling layers. Each tiling has four-dimensional tiles, covering the space of corresponding to and .

All RL agents use the -greedy policy with . The learning update uses the learning rate and discount factor .

Figures 4, 5, and 6 show moving averages (of degree 1000) of period costs, in P1, P2, and P3, obtained with different learning agents, as indicated in the legends (“R.TD” is short for RSarsa.TD). Figures 7 and 8 show box plots of average costs obtained with the different methods in early and later periods, respectively.

The results are summarized in Table 1. The computation costs of the methods are measured by relative average computation time per epoch, shown in lines 1–3. Average costs are used as the inventory management performance and they are shown in lines 4–6 for early periods (periods 1–2000 in P1 and P2 and periods 1–4000 in P3) and lines 7–9 for later periods (periods after early periods). The numbers in each entry indicate average costs obtained from the corresponding methods. Parentheses reveal results from one-side Wilcoxon’s rank sum tests: “W” indicates that the average cost is significantly lower than an average cost obtained from SARSA (); otherwise, the value is shown instead.

The computation costs of RSarsa, PRS, and PRS.Beta (full rumination) are about 20–30 times of SARSA (RL without rumination). RSarsa.TD (selective rumination) dramatically reduces the computation cost of rumination at scales of 5–7 times. An evaluation of the effectiveness of each method (compared to SARSA) shows that RSarsa and PRS.Beta significantly outperform SARSA in early periods for all 3 problems. Average costs obtained from RSarsa.TD are lower than ones from SARSA, but significance tests can confirm only results in P1 and P2. It should be noted that PRS results do not show significant improvement over SARSA. This agrees with results in a previous study [17]. With respect to performance in later periods, average costs of PRS and PRS.Beta are lower than SARSA’s in all 3 problems. However, significance tests can confirm only few results (P1 for PRS and P1 and P2 for PRS.Beta).

Table 2 shows a summary of results from significance tests comparing the previous study’s RRL methods (RSarsa and PRS) to our proposed methods (RSarsa.TD and PRS.Beta). The entries with “W” indicate that our proposed method on the corresponding column significantly outperforms a previous method on the corresponding row (). Otherwise, the value is indicated.

6. Conclusions and Discussion

Our results have shown that PRS.Beta achieves our goal, which is to address the slow learning rate of PRS, as it significantly outperforms PRS in early periods in all 3 problems, and to address the long-term learning quality of RSarsa, as it significantly outperforms RSarsa in later periods in P1 and P2 and its average cost is lower than RSarsa’s in P3. It should be noted that although the performance of RSarsa.TD may not seem impressive when compared to PRS.Beta’s, RSarsa.TD requires less computational cost. Therefore, as RSarsa.TD shows some improvement over SARSA, this reveals that selective rumination is still worth further study.

It should be noted that PRS.Beta employs TD error to control its behavior (6). The notion to extend TD error to determine learning factors is not limited only to rumination. It may be beneficial to use the TD error signal to determine other learning factors, such as the learning rate, for an adaptive-learning-rate agent. A high TD error indicates that the agent has a lot to learn, that what it has learned is wrong, or that things are changing. For each of these cases, the goal is to make the agent learn more quickly. So, a high TD error should be a clue to increase the learning rate, increase the degree of rumination, or increase the chance to do more exploration.

To address issues in RL worth investigation, more efficient Q-value representations should be among the priorities. Regardless of the action policy, every RL policy relies on Q-values to determine the action to take. Function approximations suitable to represent Q-values should facilitate efficient realization of an action policy. For example, -greedy policy has to search for an optimal action. A Q-value representation suitable for an -greedy policy should allow efficient search for an optimal action given a state. Another general RL action policy is the softmax policy [2]. Given a state, the softmax policy has to evaluate the probabilities of candidate actions based on their associated Q-values. A representation that facilitates efficient mapping from Q-values to the probabilities would have great practical importance in this case. Due to the interaction between the Q-value representation and the action policy, there are considerable efforts to combine these two concepts. This is an active research direction under the rubric of policy gradient RL [18].

There are many issues in RL needed to be explored, theoretically and for application. Our findings reported in this paper provide another step in understanding and applying RL to practical inventory management problems. Even though we only investigated inventory management problems here, our methods can be applied beyond this specific domain. This early step in the study of using TD error to control learning factors, along with investigation of other issues in RL, would yield a more robust learning agent that is useful in a wide range of practical applications.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.


The authors would like to thank the Colorado State University Libraries Open Access Research and Scholarship Fund for supporting the publication of this paper.