#### Abstract

We study an online multisource multisink queueing network control problem characterized with self-organizing network structure and self-organizing job routing. We decompose the self-organizing queueing network control problem into a series of interrelated Markov Decision Processes and construct a control decision model for them based on the coupled reinforcement learning (RL) architecture. To maximize the mean time averaged weighted throughput of the jobs through the network, we propose a reinforcement learning algorithm with time averaged reward to deal with the control decision model and obtain a control policy integrating the jobs routing selection strategy and the jobs sequencing strategy. Computational experiments verify the learning ability and the effectiveness of the proposed reinforcement learning algorithm applied in the investigated self-organizing network control problem.

#### 1. Introduction

Queueing network optimization problems widely exist in the fields of manufacturing, transportation, logistics, computer science, communication, healthcare [1], and so on. With the rapid development of the Internet of Things, large-scale logistics distribution network, wireless sensor network [2–4], new generation wireless communication network, and other network technologies, more and more new network structures and new network optimization problems emerge. Optimization of network control is an important factor to affect the efficiency of network operation.

Self-organizing networks are a kind of new queueing network system. In self-organizing networks, each station or node can establish a link with its adjacent stations or nodes, receive jobs from other stations or nodes, and transfer them to other stations or nodes. Due to the complex link relationship of stations or nodes, the paths and the sequence of the jobs to go through the network are very complicated. Consequently, the control problem of this kind of networks is very complicated. In literature, researchers concentrate on the control of multihop network, which is a kind of network with self-organizing characteristic. The research methods of multihop network control mainly include two categories. The first one is to decompose it into a series of single-station queueing problems or tandem queueing network problems [5]. The second kind of methods is to simplify the multihop network control problem into link scheduling problem [6] or queue management problem [7]. The main task of link scheduling is to establish a link between the stations and select the appropriate paths for job transferring. He et al. [8] proposed a load-based scheduling algorithm to optimize the link scheduling between stations so as to achieve the load balance of each station and reduce the degree of paths congestion. Pinheiro et al. [9] studied link scheduling and path selection by fuzzy control. Augusto et al. [10] simultaneously optimized link scheduling and routing planning. Nandiraju et al. [11] studied the problem of restricting the length of transmission path and improved the efficiency of long-path transmission. In order to enlarge the network capacity, Gupta and Shroff [12] optimized link scheduling and path selection by solving the maximum weighted matching problem subject to the -hop interference constraints. The main task of queue management is to classify the jobs to the job groups and to determine the transmission order of the job groups. Fu and Agrawal [7] focused on the problem of jobs classification in queue management and improved the efficiency by batch processing of the jobs. Nieminen et al. [13] and Wang et al. [14] studied optimization of energy management and queue management in multihop networks. Liu et al. [15] reduced the transmission delay and shortened the queue length by modeling and analysis based on Markov chain. Kim et al. [16] considered the fairness of customer services and improved the efficiency of the network while reducing the difference of customers’ waiting time. Vučević et al. [17] and Zhou et al. [18] used a reinforcement learning (RL) algorithm to optimize queue management that allocates the data packets to the queues.

In this paper, we study an online multisource multisink queueing network control problem limited by the queue length. We consider the inherent self-organization characteristic of the queueing network problem, transform the problem into Markov Decision Processes (MDP), and then construct an RL system to deal with them. An optimized control strategy and a global optimized solution are obtained by the proposed RL system. The rest of this paper is organized as follows: we introduce the self-organizing queueing network control problem in Section 2, formulate the problem into an RL model in Section 3, present the detailed RL algorithm in Section 4, conduct computational experiments in Section 5, and draw conclusions in Section 6.

#### 2. Problem Statement

The online self-organizing network control problem concerned in this paper is described as follows. There are stations in the network and types of jobs arrive at the network. Let denote the set of stations in the network and let denote the th job of type . Take the network in Figure 1 as an example. As shown in Figure 1, the self-organizing queueing network is composed of three types of stations. The first type of stations is arrival stations. Each type of jobs has a specific arrival station. The specified arrival station for type jobs is , where denotes the set of arrival stations. The second type of stations is transfer stations, which receive jobs and send them to other transfer stations or destination stations. The set of transfer stations is denoted by , where denotes the th transfer station, denotes the number of transfer stations, and . The third type of stations is destination stations. Each type of jobs has a specific destination station and the jobs of the same type aim to arrive at the same destination station. The specified destination station for type jobs is , where denotes the set of destination stations. Once a job is processed by its specified destination station, it passes through the entire network.

Jobs of type arrive at their arrival station following Poisson process with rate parameter . The arriving jobs wait in the queue of station for transferring. Let denote the set of stations visible to station . Specifically, each arrival station corresponds to a set of visible transfer stations. denotes the set of transfer stations visible for the arrival station of type jobs. Each transfer station also corresponds to a set of visible stations. denotes the set of stations visible for transfer station . contains one or more transfer stations or destination stations. Transfer station is qualified to transfer the jobs of the types in set . One station transfers only one job at a time. From the arrival station to the destination station, each job needs to pass through at least one transfer station. Under certain conditions, the arrival station can establish a link with transfer station if and then send a type job to station . The job waits in the queue to be transferred to another station. The arrival station cannot send a job to transfer station if . Similarly, transfer station can establish a link with station if , select a job from its queue, and send this job to station . If station , then transfer station sends a type job to station only if is the destination station for type jobs, that is, station .

There exist a lot of feasible paths for a job from its arrival station to its destination station. Take the network in Figure 1 as an example. Assume that the arrival station for type jobs is station and the destination station for type jobs is station . The sets of visible stations of stations ~ are , , , , and , respectively. Thus, many paths are feasible for type jobs from the arrival station to the destination station, such as *, **, **, *and .

The queue capacity of each transfer station is limited, which is denoted by . The maximum number of simultaneous transferring jobs for each station is also limited; that is, the number of jobs being transferred to this station cannot exceed a predetermined quantity . Though a station can be linked by more than one upstream station, it is allowed to link to at most one downstream station. In order to establish a link between two stations to send jobs, the following conditions must be met: (1) the downstream station is visible for the upstream station; (2) the number of stations transferring jobs to the downstream station is less than the predetermined maximum number; (3) the queue length of the downstream station has not reached the upper limit.

A station is not allowed to send a job to another station if a link is not established between the two stations. An upstream station is allowed to send one or more jobs after establishing a link with a downstream station until it establishes a new link with another downstream station. Assume that a station can only send a job to another station at a time. The time required for establishing a link between two stations is a random variable. Let denote the time required for establishing a link between stations and , which follows exponential distributed random variable with parameter . The time consumed in transferring a job depends on the station and the job type. The transferring time of a type job by stations is denoted by , which follows exponential distributed random variable with parameter .

The task of network control is to control the routes and the transferring sequence of the jobs. Based on the dynamic status of the queueing network, each station selects an appropriate job from its queue and sends it to an appropriate transfer station or its destination station. The control objective function is to maximize the time averaged weighted throughput (i.e., the weighted throughput rate) of the jobs across the network, which is defined aswhere is the running time, is the total number of jobs of all types passing across the network by time , and is the weight of the th job passing through the network.

The problem addressed above is of a new queueing network problem with the following characteristics. (1) The first one is that the problem is an online dynamic control problem for multisource multisink networks with limited queue length. (2) The second one is self-organization characteristics of the jobs’ transferring paths. There are multiple kinds of jobs with different destination stations. For arbitrary job, many alternative paths exist from the arrival station to its destination station. The most suitable path is not necessarily the shortest one or the one with the fewest transfer stations. Moreover, the more complex the network structure is, the more flexible the path selection is. Network control needs to be conducted considering the factors such as the global situation, the transferring time of each job on each station, the efficiency of each station, and the length of each station’s queue. (3) The third one is the self-organization characteristics of network structure. The topological structure of the network is complex and may be dynamic; that is, the location and the number of stations and the relationship among the stations may vary over time. The control approach for the queueing network should be able to adapt to the changes of network topology structure.

In the following sections, an RL model is constructed to depict the above network control problem and an RL algorithm is proposed to deal with it.

#### 3. The Reinforcement Learning Model

To depict the size of the feasible solution space and the difficulty of the self-organizing network control problem, we use a tandem queueing network control problem as an extremely simple example. This tandem queueing network is composed of tandem stations and jobs of different types that need to be processed on each station in the order of . Suppose that each station processes only one job at a time and the processing sequences of the jobs on different stations are independent. For each station, there are possible permutations of the jobs. Thus the number of feasible solutions to this -station network control problem is . If and , then is an enormous figure much larger than 10^{9}. Moreover, the general online self-organizing network control problem is much more complicated than the above tandem queueing network control problem with the same number of stations and job types. Due to the large scale of the self-organizing network, it is difficult to formulate the whole system as a unified model and solve it. We formulate the RL model of the self-organizing queueing network problem described in the previous section. According to the characteristics of the self-organizing queueing network control problem and following the decomposition-association strategy, the whole queueing network is decomposed into a number of closely connected small-scale subnetworks and a Markov Decision Process (MDP) model is constructed for each subnetwork. That is, the whole queueing network control problem is converted into a plurality of interconnected MDP problems. The subnetworks are connected by the coupling mechanism. By using this method, we can enhance the adaptability and robustness of the model and make it more adaptive to the changes of the topology structure of self-organizing networks so as to reduce the size of the problem and keep the essential structure of the original problem.

Construct a subnetwork for each station in the self-organizing queueing network. For each station, its corresponding subnetwork is centered on this station and contains its adjacent stations linked with this station. Each subnetwork corresponds to an RL subsystem which is used to solve the MDP model formulating the control problem for this subnetwork. The state transition of an RL subsystem directly causes the state variation of its adjacent RL subsystems, thus the adjacent RL subsystems are coupled in state transition.

Reinforcement learning (RL) is a machine learning method proposed to solve large-scale multistage decision problems or Markov Decision Processes with incomplete probability information. In the following we convert the self-organizing queueing network control problems of the subnetworks into RL problems, mainly including representation of states, construction of actions, and definition of the reward function. In this section we also introduce the coupling mechanism of the RL subsystems.

##### 3.1. States Representation

State variables describe the primary characteristics of the RL subsystem. The state of the RL subsystem is represented by vector , which is composed of state variables and defined aswhere denotes the type of the job being transferred from the central station of the th RL subsystem at the th state ( equals zero if the central station is idle), denotes the number of type jobs waiting in the queue of the central station of the th RL subsystem at the th state, denotes the downstream station to which the central station of the th RL subsystem is linking at the th state, and denotes the number of upstream stations linking the central station of the th RL subsystem at the th state. The trigger events for state transition are arrival of a job and completion of transferring a job on a station.

##### 3.2. Actions

When the central station of an RL subsystem is idle and a trigger event occurs, the RL subsystem selects an action. Since the task of the queueing network control is to control the routes and the transferring sequence of the jobs, an action of an RL subsystem contains two decisions: one decision is to determine which station this central station is going to connect to and the other decision is to select a job from the jobs waiting in its queue to transfer. For the th subsystem, the number of available actions is , where denotes the set of stations visible to station , denotes the set of qualified job types that station may select and transfer to station , and denotes the cardinality of set . For an RL subsystem (e.g., the th subsystem), where a trigger event occurs, a feasible action for this subsystem is to select station to link with and send a type job if all the following conditions are satisfied: (1) ; that is, station is a visible station for station ; (2) the number of upstream stations currently transferring jobs station is less than the predetermined number ; (3) the queue length of station is less than ; (4) job type is a qualified job type for station to select and transfer to station . Trivially, a null action (i.e., selecting no job) is selected if no job is waiting in the queue of station or one of the above conditions is not satisfied.

##### 3.3. The Reward Function

The reward function indicates the instant and long-term impact of an action; that is, the immediate reward indicates the instant impact of an action and the average reward indicates the optimization of the objective function value. Thus, the whole RL system receives larger time averaged reward for larger time averaged weighted throughput. Let denote the time at the th decision-making epoch of the th RL subsystem, that is, the time when the state of the th RL subsystem transfers from into . Let denote the reward that the th RL subsystem selects action at state and receives reward at time . Without loss of generality, assume that the central station of the th RL subsystem transfers a type job during time interval and completes transferring the job at time ; then is defined aswhere is the weight of the th job type, is the label of the central station for the th job type, is the label of the downstream station to which the type job just transferred from station , and is the length of the shortest path from the arriving station to the destination station of a type job (i.e., the least amount of flow time for a type job across the whole network from its arrival station to its destination station). The label of a station for the th job type is defined as the length of the shortest path from the th station to the destination station of the th job type, that is, the least amount of flow time for a type job from the current station to the destination station. The immediate reward represents the progress of a job’s passing through the network during the time between two state transitions. In the following we show the property of the reward function.

Lemma 1. *For a type job , assume that this job attains a transfer station with label . Whichever path this job attains transfer station through, the accumulated reward of all RL subsystems caused by this job is where is the weight of the th job type and is the length of the shortest path from the arriving station to the destination station of the th job type.*

*Proof. *Without loss of generality, assume that the job starts from its arrival station and before it attains station , it is transferred successively by transfer stations . Let denote the label of station for the th job type. Then the reward caused by the type job during the process of being transferred from station to station isSimilarly, the reward caused by the job during the process of being transferred from station to station isConsequently, the accumulated reward caused by the job during the process of being transferred from station to station isBy Lemma 1 we obtain the following lemma.

Lemma 2. *For a type job , assume that this job attains its destination station . Whichever path this job passes through the network from its arrival station to its destination station, the accumulated reward of all RL subsystems caused by this job during the whole process is .*

*Proof. *By Lemma 1, the accumulated reward caused by this job during the whole process of passing through the network is where denotes the label of station . Since is the destination station of type jobs, by the definition of a station’s label we get . Hence,A state transition takes place when a new job arrives at the network or a job is completely transferred by any station. Without loss of generality, we assume that the sojourn time of any arriving job in the network is finite. Hence, the total number of jobs staying in the network at any time is finite. According to Lemmas 1 and 2 we prove the following theorem.

Theorem 3. *If there exists a positive integer such that the total number of jobs staying in the network is less than or equal to , then maximizing the time averaged weighted throughput of the network (i.e., the control objective function (1)) is equivalent to maximizing the time averaged reward of all RL subsystems over infinite time.*

*Proof. *Assume that the jobs arriving at the network are divided into two sets and . is the set of jobs having passed through the network by time and is the set of jobs still staying in the network at time . Let denote the th arriving job. Thus, according to Lemmas 1 and 2, the accumulated reward of all RL subsystems by time iswhere is the weight of job , is the length of the shortest path from the arrival station of job to the destination station of job , and denotes the label of the station at which job is at time . By definition, the time averaged weighted throughput by time isIt follows from (10) and (11) thatBecause the total number of jobs staying in the network is less than or equal to , that is, , we haveIt follows from (12) and (13) thatSince is a constant, we haveIt follows from (14) and (15) thatConsequently, maximizing the time averaged weighted throughput is equivalent to maximizing the time averaged reward over infinite time. This links the long-term average reward of the RL system and the optimization of the objective function value for the network control problem.

##### 3.4. The State Transition Mechanism and the Coupling Mechanism

The trigger events for state transition in an RL subsystem are completion of transferring a job to the central station of this subsystem and completion of transferring a job from the central station of this subsystem. Take the th RL subsystem as an example to illustrate the state transition mechanism. Currently the th RL subsystem is at the th decision-making state . This subsystem takes an action and it transfers to an interim state immediately. When a trigger event for the th RL subsystem takes place, the system transfers to a new decision-making state and receives a reward , which is computed due to and . The above procedure is repeated until a terminal state is attained; that is, all the jobs reach their destination stations. An episode is a trajectory from the initial state to a terminal state of a schedule horizon. With the states representation defined as (2), the decision process is a Markov Decision Process.

An RL subsystem is coupled with another RL subsystem if a link is established between the central stations of these two subsystems. For example, for any two stations and , if there is a link from to or from to , then the two RL subsystems with central stations and are coupled with each other. The coupling mechanism of the RL subsystems (as shown in Figure 2) mainly contains the following two aspects. (1) The first one being the coupled correlation of the state transitions of coupling RL subsystems. One action in a subsystem can change the state of its corresponding coupling subsystem; that is, the state transition of a subsystem directly causes the variation of state variables of its corresponding coupling subsystem. (2) The second one being broadcast mechanism of reward signals in the coupling RL subsystems and the coupled iteration of state values of RL subsystems.

To describe the coupling mechanism more precisely and explain the overlap among subsystems, we give an illustrative example. Suppose that is a visible station to station and they are the central stations of the th RL subsystem and the th RL subsystem. Currently the th RL subsystem is at the th decision epoch and station is idle. The th RL subsystem is at the th decision-making state and the th RL subsystem is at the th state , where and . At this decision epoch, the th subsystem takes an action which selects a job of type waiting in its queue, establishes a link with station , and transfers the job to station . The th subsystem transfers to an interim state immediately. The difference of states and is that the job type being transferred on station is , the number of type jobs waiting the queue of station decreases by one, and the downstream station to which station is linked is station . The th subsystem transfers to an interim state immediately. The difference of states and is that the number of upstream stations linking station increases by one. When station completes transferring the type job, the th subsystem receives a reward and it transfers to the next decision-making state . The th subsystem transfers to the next decision-making state simultaneously. For each RL subsystem, the state transition process continues as above until all the jobs reach their destination stations.

For the th RL subsystem, when its central station completes transferring a type job at the decision epoch, the state value of the th RL subsystem is updated with its immediate reward following the proposed RL algorithm. The state values of all the subsystems coupled with the th subsystem are also updated. Assume that the th subsystem is coupled with the th subsystem; then the virtual reward, denoted by , for updating the state value of the th subsystem is defined aswhere is the shortest path from the arrival station to the destination station of the th job type and is the transfer time of a type job from station . The detailed computation procedure of the RL algorithm is shown in Section 4.

#### 4. A Reinforcement Learning Algorithm with Time Averaged Reward

The online self-organizing queueing network control problem is converted into an RL problem in the previous section. We apply reinforcement learning to solve the RL problem and use -greedy policy to balance exploration and exploitation. -greedy policy means that the algorithm selects the greedy action with probability and selects an available action randomly with probability , where is usually a small positive number.

We propose the following reinforcement learning algorithm (Algorithm 4), where is the number of RL subsystems, denotes the state space of the th RL subsystem, denotes the value of state , is the learning rate for state values, denotes the estimated reward rate of the th RL subsystem, is the learning rate for the estimated reward rates, is the total number of jobs required to pass through the network, and is the number of jobs currently having passed the network.

*Algorithm 4 (a reinforcement learning algorithm with time averaged reward for online self-organizing queueing network control problem). * *Step 1*. For each job type , create a network composed of the stations qualified to transfer type jobs. For each station in network , compute its label .*Step 2*. Set parameters , , , and to the predetermined values and initialize to be zero. For each , set and current time and initialize to be one. For each , set current state as the initial state and initialize for all ; that is, initialize the state value function for all states of all RL subsystems.*Step 3*. Determine the station where the trigger event occurs (e.g., station ). Determine the current state for station and the set of available actions for the th RL subsystem at state .*Step 4*. Select action based on the current state value of the th RL subsystem following the -greedy control policy.*Step 5*. Implement action and determine the next time , the time at the decision-making epoch of the th RL subsystem, following the state transition mechanism. Determine the state at time and calculate reward by (3).*Step 6*. Update the state value asUpdate aswhere denotes an available action which may be taken at state , denotes the state to which if action is taken at state , then the th subsystem is transferred, denotes the time when the next state transition takes place if action is taken at state , denotes an available action which may be taken at state , denotes the state to which if action is **taken **at state , then the th subsystem is transferred, and denotes the time when the next state transition takes place if action is taken at state .*Step 7*. Update the RL subsystems coupled with the th RL subsystem as follows. For each RL subsystem coupled with the th RL subsystem (e.g., the th subsystem at state ), compute by (17) and then update and as*Step 8*. Set and adjust the current time by for the th subsystem. Update the current states of all subsystems following the state transition mechanism.*Step 9*. If the trigger event is that a job finishes passing through the network, then . If the number of jobs across the network is , then the algorithm terminates; otherwise go to Step 3.

#### 5. Computational Experiments

In this section, we conduct computational experiments to examine the learning ability and the performance of the proposed reinforcement learning algorithm (Algorithm 4). We first use a queueing network with four types of jobs as the test bed. A test problem specifies the number of jobs to be scheduled and each problem generates 50 instances. To verify the convergence of the state value function during the learning process, a test problem with 10000 jobs is used for each instance. An instance is the whole process of generating a schedule for the 10000 jobs in an instance from the initial state to the state when all the jobs have passed through the network. The weights of all types of jobs are in the range and the parameters of transferring times and link establishing times are in the range . The transferring times and link establishing times are exponentially distributed. The parameters of the proposed algorithm are set with , , , and . The maximum size of the queues of transfer stations is set to be 5.

To investigate the convergence of the state value function, we examine the variation of the state values during the learning process. The experiment results in this section take the average over 50 instances. Let denote the number of states in the state space of the th subsystem and denote the mean value of all states. is defined as (21), where denotes the state space of the th subsystem and denotes the value of state (). Figure 3 shows the variation of with respect to the number of jobs having crossed the network. When the number of finished jobs is larger than 3000, decreases slowly and gradually converges to −20.53.

Let denote the value of state at the time when the th job completely passes through the network. Let denote the average value of the absolution of the difference of values of all states between the two time points when the successive two jobs completely cross the network. For example, at the time when the th job completely passed through the network, is defined as

Figure 4 shows the variation of with respect to the number of finished jobs. Although both the curves in Figures 3 and 4 converge asymptotically, the shape of the curve in Figure 4 is not so smooth as Figure 3. When the number of finished jobs is larger than 3000, is less than 0.2. Figures 3 and 4 show that when the number of finished jobs increases, asymptotically converges to zero which indicates that the state values are gradually stable.

For a given problem, we can draw a “Learning Curve” to examine the learning ability of the proposed reinforcement learning algorithm. In a Learning Curve, the objective function values are averaged over 50 instances. As shown in Figure 5, -coordinate represents the number of finished jobs and -coordinate represents the time averaged weight throughput from the initial time to the current time. For example, the point on the Learning Curve indicates that the weighted throughput rate from the initial time to the time when the 2000th job is finished is 0.196. The Learning Curve increases asymptotically and rapidly in the first 3000 jobs and then fluctuates in the latter jobs. This curve shows that RL system learns quickly through interaction and finds a good policy in the former 3000 jobs. Thereafter, the improvement of the control policy gradually slows down in the latter jobs.

To validate the adaptability of Algorithm 4 for various problems and examine the effectiveness of Algorithm 4, more extensive test problems are also randomly generated and conducted to demonstrate the performance of Algorithm 4. We consider the networks with different topology structures corresponding to the test problems with different numbers of stations ( takes 10, 15, 20, 25, 30, 35, and 40, resp.). For a specific number of stations, the values of the relative arrival intensity index (RAII) is, respectively, 0.50, 0.75, 1.00, 1.50, and 2.00. The RAII index indicates the number of arrival jobs at the arrival stations in these extensive test problems relative to the above test problem. The larger RAII index is, the more jobs arrive at the arrival stations. Each extensive test problem generates 50 instances. We use three approaches, the -greedy approach (Algorithm 4), the completely random routing (CRR) approach, and the purely greedy routing (PGR) approach, to solve the test problems. The CRR approach and the PGR approach correspond to and , respectively. Table 1 shows the time averaged weighted throughput (AWT) of the jobs across the network, respectively, obtained by three comparative approaches when 10000 jobs have passed through the network. The experiment results in Table 1 are also averaged over 50 instances. As shown in Table 1, the AWT index increases with respect to the growth of the RAII index. When the RAII index is larger than one, the growth rate of the AWT index is slower than the case that the RAII index is less than one, since high intensity of arrival jobs leads to congestion of the networks. For each test problem, -greedy approach obtains larger AWT index than the CRR approach and the PGR approach. Table 2 lists the relative AWT values of the -greedy approach and the PGR approach. The relative AWT value of a problem for an approach is defined as the AWT index obtained by this approach divided by the AWT index obtained by the CRR approach.

As shown in Table 2, the relative AWT values obtained by the PGR approach range from 1.178 to 1.303 and the relative AWT values obtained by the -greedy approach range from 1.212 to 1.351. For the test problems with the RAII index taking 0.50, 0.75, 1.00, 1.50, and 2.00, respectively, the average relative AWT values obtained by the PGR approach are 1.223, 1.226, 1.249, 1.256, and 1.265, respectively, and the average relative AWT values obtained by the -greedy approach are 1.262, 1.269, 1.291, 1.305, and 1.316, respectively. Compared with the CRR approach, the PGR approach and the -greedy approach improve the weighted throughput rate with an average proportion of 24.4% and 28.8%, respectively. Experiment results show that the -greedy approach is superior to both the CRR approach and the PGR approach. Experiment results validate the adaptability and robustness of the proposed algorithm for test problems of various scales and different topology structures. Experiment results also indicate that the reinforcement learning system learns to select an appropriate action on different occasions, links stations and schedules jobs flexibly in online environment, and obtains optimized results.

#### 6. Conclusions

We decompose the investigated online self-organizing queueing network control problem with time averaged weighted throughput objective into a series of cross-correlated Markov Decision Processes and convert them into a coupled reinforcement learning model. In the reinforcement learning system, maximizing the time averaged weighted throughput of all jobs across the network is equivalent to maximizing the time averaged reward of all subsystems. Online reinforcement learning algorithm with time averaged reward is adopted to solve the reinforcement learning model and obtains a control policy integrating the jobs routing selection strategy and the jobs sequencing strategy. Computational experiments verify the convergence of the state value function through the learning process and the effectiveness of the proposed algorithm applied in the self-organizing queueing network control problem. In the test problems, the proposed algorithm improves the weighted throughput rate of the networks with a remarkable average proportion through online learning process. The experiment results show that the reinforcement learning system is adaptive to different network topology structures and it learns an optimized policy through interaction with the control process.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper. The mentioned received funding in the “Acknowledgments” did not lead to any conflicts of interest regarding the publication of this manuscript.

#### Acknowledgments

This work is supported by Natural Science Foundation of Guangdong Province (no. 2015A030313649, no. 2015A030310274), Science and Technology Planning Project of Guangdong Province, China (no. 2015A010103021), Key Platforms and Scientific Research Program of Guangdong Province (Featured Innovative Projects of Natural Science, no. 2015KTSCX137), and National Natural Science Foundation of China (Grant no. 61703102).