#### Abstract

Chip attach is the bottleneck operation in semiconductor assembly. Chip attach scheduling is in nature unrelated parallel machine scheduling considering practical issues, for example, machine-job qualification, sequence-dependant setup times, initial machine status, and engineering time. The major scheduling objective is to minimize the total weighted unsatisfied Target Production Volume in the schedule horizon. To apply -learning algorithm, the scheduling problem is converted into reinforcement learning problem by constructing elaborate system state representation, actions, and reward function. We select five heuristics as actions and prove the equivalence of reward function and the scheduling objective function. We also conduct experiments with industrial datasets to compare the -learning algorithm, five action heuristics, and Largest Weight First (LWF) heuristics used in industry. Experiment results show that -learning is remarkably superior to the six heuristics. Compared with LWF, -learning reduces three performance measures, objective function value, unsatisfied Target Production Volume index, and unsatisfied job type index, by considerable amounts of 80.92%, 52.20%, and 31.81%, respectively.

#### 1. Introduction

Semiconductor manufacturing consists of four basic steps: wafer fabrication, wafer sort, assembly, and test. Assembly and test are back-end steps. Semiconductor assembly contains many operations, such as reflow, wafer mount, saw, chip attach, deflux, EPOXY, cure, and PEVI. IS factory is a site for back-end semiconductor manufacturing where chip attach is the bottleneck operation in the assembly line. In terms of Theory of Constraints (TOC), the capacity of a shop floor depends on the capacity of the bottleneck, and a bottleneck operation gives a tremendous impact upon the performance of the whole shop floor. Consequently, scheduling of chip attach station has a significant effect on the performance of the assembly line. Chip attach is performed in a station which consists of ten parallel machines; thus, chip attach scheduling in nature is some form of unrelated parallel machine scheduling under certain realistic restrictions.

Research on unrelated parallel machine scheduling focuses on two sorts of criteria: completion time or flow time related criteria and due date related criteria. Weng et al. [1] proposed a heuristic algorithm called “Algorithm 9” to minimize the total weighted completion time with setup consideration. Algorithm 9 was demonstrated to be superior to six heuristic algorithms. Gairing et al. [2] presented an effective combinatorial approximate algorithm for makespan objective. Mosheiov [3] and Mosheiov and Sidney [4] converted an unrelated parallel machine scheduling problem with total flow time objective into polynomial number of assignment problems. The scheduling problem was tackled by solving the derived assignment problems. Yu et al. [5] formulated unrelated parallel machine scheduling problems as mixed integer programming and dealt with them using Lagrangian Relaxation. They examined six measures such as makespan and mean flow time. Promising results were achieved compared with a modified FIFO method.

Besides completion time or flow time related criteria, tardiness objectives are also employed frequently. Dispatching rules are widely applied to production scheduling with a tardiness objective, such as Earliest Due Date (EDD), Shortest Processing Time (SPT), Critical Ratio (CR), Minimal Slack (MS), Modified Due Date (MDD) [6, 7], Apparent Tardiness Cost (ATC) [8, 9], and COVERT [10–12]. More complicated heuristic algorithms and local search methods are also developed. Bank and Werner [13] addressed the problem of minimizing the weighted sum of linear earliness and tardiness penalties in unrelated parallel machine scheduling. They derived some structural properties useful to searching for an approximate solution and proposed various constructive and iterative heuristic algorithms. Liaw et al. [14] found the efficient lower and upper bounds of minimizing the total weighted tardiness by a two-phase heuristics based on the solution to an assignment problem. They also presented a branch-and-bound algorithm incorporating various dominance rules. Kim et al. [15] studied batch scheduling of unrelated parallel machines with a total weighted tardiness objective and setup times consideration. They examined four search heuristics for this problem: the earliest weighted due date, the shortest weighted processing time, the two-level batch scheduling heuristic, and the simulated annealing method.

We are concerned in the paper about a particular Target Production Volume (TPV) oriented optimization objective. In real production in IS factory, the planning department figures out the TPV of each job type on chip attach operation in a schedule horizon. Thus, the major objective of chip attach scheduling is to meet TPVs to the fullest extent (see Section 2.1 for details). We apply reinforcement learning (RL), an artificial intelligence method, for this study. We first present a brief concept of reinforcement learning.

##### 1.1. Learning

Reinforcement learning is a machine learning method proposed to approximately solve large-scale Markov Decision Process (MDP) or Semi-Markov Decision Process (SMDP) problems. Reinforcement learning problem is a model in which an agent learns to select optimal or near-optimal actions for achieving its long-term goals (to maximize the total or average reward) through trial-and-error interactions with dynamic environment. In this paper, we address RL problems of episodic task, that is, problems with a terminal state. Sutton and Barto [16] defined four key elements of RL algorithms: policy, reward function, value function, and model of the environment. A policy determines the agent’s action at each state. A reward function determines the payment on transition from one state to another. A value function specifies the value of a state or a state-action pair in the long run, the expected total reward for an episode. By learning from interaction between the agent and its environment, value-based RL algorithms aim to approximate the optimal state or action value function through iteration and thus find a near-optimal policy. Compared with dynamic programming, RL algorithms do not need to know the transition probability and reduce the computational effort.

-learning is one of the most widely applied RL algorithms based on value iteration. -learning was first proposed by Watkins [17]. Convergence results of tabular -learning were obtained by Watkins and Dayan [18], Jaakkola et al. [19], and Tsitsiklis [20]. Bertsekas and Tsitsiklis [21] demonstrated that -learning produces the optimal policy in discounted reward problems under certain conditions. -learning uses , called -value, to represent the value of a state-action pair. is defined as follows: where denotes the state space, denotes the transition probability from to taking action a, denotes the reward on transition from to taking action a, is a discounted factor, and is the optimal state value function.

In terms of Bellman optimality function, the following holds for arbitrary , where denotes the set of actions available for state :

From (1) and (2), the following equation holds:

Equation (3) is the basic transformation of -learning algorithm. The step-size version of -learning is where () is learning rate. Using historical samples or simulation experiments, -learning obtains a near-optimal policy by driving action-value function, , towards the optimal action-value function, , through iteration based on formula (4).

Recently, RL has drawn attention from production scheduling. S. Riedmiller and M. Riedmiller [22] used -learning to solve stochastic and dynamic job shop scheduling problem with the overall tardiness objective. Some typical heuristic dispatching rules, SPT, LPT, EDD, and MS, were chosen as actions and compared with the -learning method. Aydin and Öztemel [23] applied a -learning algorithm to minimize the mean tardiness of dynamic job shop scheduling. Their results showed that the RL-scheduling system outperformed the use of each of the three rules (SPT, COVERT, and CR) individually with mean tardiness objective in most of the testing cases. Hong and Prabhu [24] formulated setup minimization problem (minimizing the sum of due date deviation and setup cost) in JIT manufacturing systems as an SMDP and solved it by tabular -learning method. Experiment results showed that -learning algorithms achieved significant performance improvement over usual dispatching rules such as EDD in complex real-time shop floor control problems for JIT production. Wang and Usher [25] applied -learning to select dispatching rules for the single machine scheduling problem. Csáji et al. [26] proposed an adaptive iterative distributed scheduling algorithm operated in a market-based production control system, where every machine and job is associated with its own software agent. Singh et al. [27] proposed an online reinforcement learning algorithm for call admission control. The approach optimized the SMDP performance criterion with respect to a family of parameterized policies. Multi-agent reinforcement learning system has also been applied to scheduling or control problems, for example, Kaya and Alhajj [28], Paternina-Arboleda and Das [29], Mariano-Romero et al. [30], Vengerov [31], Iwamura et al. [32].

Applications of RL algorithms to scheduling problems have not been thoroughly explored in the prior studies. In this study, we employ -learning algorithm to resolve chip attach scheduling problem and achieve overwhelming experimental results compared with six heuristic algorithms. The remainder of this paper is organized as follows. We describe the problem and convert it into RL problem explicitly in Section 2, present the RL algorithm in Section 3, conduct the computational experiments and analysis in Section 4, and draw conclusions in Section 5.

#### 2. RL Formulation

##### 2.1. Problem Statement

The scheduling problem concerned in this paper is described as follows. The work station for chip attach operation consists of parallel machines and processes types of jobs. The bigger the weight of a job type is, the more important it is. Each job needs to be processed on one machine only and one machine processes at most one job at a time. Any job type (say, ) is only allowed to be processed on subset of the parallel machines. The jobs of the same type have a deterministic processing time (; ) if they are processed on machine . The machines are unrelated; that is, is independent of for all jobs and all machines . The production is lot based. Normally, one lot contains more than 1000 units. Thus, the processing time is the time for processing one lot and processing is nonpreemptive (i.e., once a machine starts processing one lot, it cannot process another one until it completely processes this lot). Setup time between job type and is (, ). The setup times are deterministic and sequence dependant. Trivially, holds for arbitrary () and holds for arbitrary ().

The usage of a machine is considered to be in one of two categories: engineering time (e.g., maintenance time) and production time. We only need to schedule the production in production time, the total available time in a schedule horizon deducting the engineering time. Production time is divided into initial production time and normal production time. We consider the initial machine status in the schedule horizon. If a machine is processing a lot, called “initial lot,” at the beginning of a schedule horizon, it is not allowed to process any other lot until it completely processes the remaining units in the initial lot (called initial volume). The time for processing the unprocessed initial volume in the initial lot is called “initial production time.” Since the production of nonbottleneck operations is determined by the bottleneck operation, we assume that the jobs are always available for processing on chip attach operation when they are needed.

The primary objective of chip attach scheduling is to minimize the total weighted unsatisfied TPV of a schedule horizon. Since equipment of semiconductor manufacturing is very expensive, machine utilization should be kept in a high level. Hence, on the premise that TPVs of all job types are entirely satisfied, the secondary objective of chip attach scheduling is to process as much as weighted excess volume to relieve the burden of the next schedule horizon. The objective function is formulated as follows: where () is the weight per unit of job type , () is the predetermined TPV of job type (including the initial volume in the initial lots), and () is the processed volume of job type . can be represented as follows: where denotes the initial volume in the initial lot processed by machine at the beginning of the schedule horizon, is lot size, and

Calculation of is rate based, interpreted as follows. Suppose machine processes lot (belonging to job type ), proceeding lot (belonging to job type ). Let denote the start time of setup for ; then, the completion time of is . Let denote the increase of processed volume of job type because of processing on machine from time to , defined as follows: is a positive number which is large enough. is set following the next inequality:

For an optimal schedule minimizing objective function (5), if (9) holds and there exists () such that , then where denotes the unprocessed volume in the last lot processed by machine at the end of this schedule horizon (i.e., the initial volume of the next schedule horizon) and

According to inequality (9), in any schedule minimizing objective function (5), any machine will not process a lot belonging to a job type whose TPV has been satisfied until TPV of any other job types is also fully satisfied. In other words, inequality (9) guarantees that the objective function takes minimization of the total weighted unsatisfied TPV (the first item of objective function (5)) as the first priority. The fundamental problem in applying reinforcement learning to scheduling is to convert scheduling problems into RL problems, including representation of state, construction of actions, and definition of reward function.

##### 2.2. State Representation and Transition Probability

We first define the state variables. State variables describe the major characteristics of the system and are capable of tracking the change of the system status. The system state can be represented by the vector where denotes the job type of which the latest lot completely processed on machine , denotes the job type of which the lot being processed on machine ( equals zero if machine is idle), , denotes the time starting from the beginning of the latest setup on machine (for convenience, we assume that there is a zero-time setup if ), is unsatisfied TPV (i.e., ), and represents the unscheduled normal production time of machine .

Considering the initial status of machines, the initial system state of the schedule horizon is where denotes the overall available time in the schedule horizon, denotes the initial production time of machine , and denotes the engineering time of machine .

There are two kinds of events triggering state transitions: (1) completion of processing a lot on one or more machines; (2) any machine’s normal production time is entirely scheduled. If the triggering event is completion of processing, the state at the decision-making epoch is represented as where . If (machine is idle), then . If the triggering event is using up a machine’s normal production time, then .

Assume that after taking action , the system state immediately transfers form to an interim state, , as follows: where for all ; that is, all machines are busy.

Let denote the sojourn time at state ; then, , . Let ; then, the state at the next decision-making epoch is represented as where Apparently we have , where denotes the one-step transition probability from state to state under action . Let and denote the system state and time, respectively, at the th decision-making epoch. It is easy to show that where is the sojourn time at state . That is, the decision process associated with is a Semi-Markov Decision Process with particular transition probability and sojourn times. The terminal state of an episode is

##### 2.3. Action

Prior domain knowledge can be utilized to fully exploit the agent’s learning ability. Apparently, an optimal schedule must be nonidle (i.e., any machine has no idle time during the whole schedule). It may happen that more than one machine are free at the same decision-making epoch. An action determines which lot to be processed on which machine. In the following, we define seven actions using heuristic algorithms.

*Action 1. *Select jobs by WSPT heuristics as follows.

*Algorithm 1. *WSPT heuristics.

*Step 1. *Let SM denote the set of free machines at a decision-making epoch.

*Step 2. *Choose machine to process job type , with , and .

*Step 3. *Remove from SM. If , go to Step 2; otherwise, the algorithm halts.

*Action 2. *Select jobs by MWSPT (modified WSPT) heuristics as follows.

*Algorithm 2. *MWSPT heuristics.

*Step 1. *Define SM as Step 1 in Algorithm 1, and let SJ denote the set of job types whose TPVs have not been satisfied at a decision-making epoch; that is, . If , go to Step 4.

*Step 2. *Choose job type to process on machine , with , and .

*Step 3. *Remove from SM. Set and update SJ. If and , go to Step 2; if and , go to Step 4; otherwise, the algorithm halts.

*Step 4. *Choose machine to process job type , with , and .

*Step 5. *Remove from SM. If , go to Step 4; otherwise, the algorithm halts.

*Action 3. *Select jobs by Ranking Algorithm (RA) as follows.

*Algorithm 3. *Ranking Algorithm.

*Step 1. *Define SM and SJ as Step 1 in Algorithm 2. If , go to Step 5.

*Step 2. *For each job type , sort the machines in increasing order of () , where is defined as follows.
Let denote the order of machine for job type .

*Step 3. *Choose job to process on machine , with . If there exist two or more machine-job combinations (say, machine-job combination , , , ) with the same minimal order; that is, holds for , then choose job type to process on machine , with .

*Step 4. *Remove or from SM. Set or and update SJ. If and , go to Step 3; if and , go to Step 5; otherwise, the algorithm halts.

*Step 5. *Choose job to process on machine , with . If there exist two or more machine-job combinations (say, machine-job combinations , ,, ) with the same minimal order, choose job type to process on machine , with .

*Step 6. *Remove or from SM. If , go to Step 5; otherwise, the algorithm halts.

*Action 4. *Select jobs by LFM-MWSPT heuristics as follows.

*Algorithm 4. *LFM-MWSPT heuristics.

*Step 1. *Define SM and SJ as Step 1 in Algorithm 2.

*Step 2. *Select a free machine (say, ) from SM by LFM (Least Flexible Machine; see [33]) rule and choose a job type to process on machine following MWSPT heuristics.

*Step 3. *Remove from SM. If , go to Step 2; otherwise, the algorithm halts.

*Action 5. *Select jobs by LFM-RA heuristics as follows.

*Algorithm 5. *LFM-RA heuristics.

*Step 1. *Define SM and SJ as Step 1 in Algorithm 2.

*Step 2. *Select a free machine (say, ) from SM by LFM rule and choose a job type to process on machine following Ranking Algorithm.

*Step 3. *Remove from SM. If , go to Step 2; otherwise, the algorithm halts.

*Action 6. *Each free machine selects the same job type as the latest one it processed.

*Action 7. *Select no job.

At the start of a schedule horizon, the system is at initial state . If there are free machines, they select jobs to process by taking one of Actions 1–6; otherwise, Action 7 is chosen. Afterwards, when any machine completes processing a lot or any machine’s normal production time is completely scheduled, the system transfers into a new state, . The agent selects an action at this decision-making epoch and the system state transfers into an interim state, . When, again, any machine completes processing a lot or any machine’s normal production time used is up, the system transfers into the next decision-making state and the agent receive reward , which is computed due to and the sojourn time between the two transitions into and (as shown in Section 2.4). The previous procedure is repeated until a terminal state is attained. An episode is a trajectory from the initial state to a terminal state of a schedule horizon. Action 7 is available only at the decision-making states when all machines are busy.

##### 2.4. Reward Function

A reward function follows several disciplines. It indicates the instant impact of an action on the schedule, that is, to link the action with immediate reward. Moreover, the accumulated reward indicates the objective function value; that is, the agent receives large total reward for small objective function value.

*Definition 6 (reward function). *Let denote the number of decision-making epoch during an episode, () the time at the th decision-making epoch, (, ) the job type of the lot which machine processes during time interval , the job type of the lot which precedes the lot machine processes during time interval , and the processed volume of job type by time . It follows that
where is an indicator function defined as

Let denote the reward function at the th decision-making epoch. is defined as

The reward function has the following property.

Theorem 7. *Maximization of the total reward in an episode is equivalent to minimization of objective function (5).*

*Proof. *The total reward in an episode is

It is easy to show that

It follows that
where and . Since is a constant, it follows that

#### 3. The Reinforcement Learning Algorithm

The chip attach scheduling problem is converted into an RL problem with terminal state in Section 2. To apply -learning to solve this RL problem, another issue arises, that is, how to tailor -learning algorithm in this particular context. Since some state variables are continuous, the state space is infinite. This RL system is not in tabular form, and it is impossible to maintain -values for all state-action pairs. Thus, we use linear function with gradient-descent method to approximate the -value function. -values are represented as linear combination of a set of basis functions, , as shown in the next formula: where are the weights of basis functions. Each state variable corresponds to a basis function. The following basis functions are defined to normalize the state variables:

Let denote the vector of weights of basis functions as follows:

The RL algorithm is presented as Algorithm 8, where is learning rate, is a discount factor, is the vector of eligibility traces for action , is an error variable for action , and is a factor for updating eligibility traces.

*Algorithm 8. *-learning with linear gradient-descent function approximation for chip attach scheduling.

Initialize and randomly. Set parameters , , and .

Let num_episode denote the number of episodes having been run. Set num_episode = 0.

While num_episode < MAX_EPISODE do

Set the current decision-making state .

While at least one of state variables is larger than zero do Select action for state by -greedy policy. Implement action a. Determine the next event for triggering state transition and the sojourn time. Once any machine completes processing a lot or any machine’s normal production time is completely scheduled, the system transfers into a new decision-making state, is a component of ). Compute reward . Update the vector of weights in the approximate -value function of action a: Set . If holds for all , set num_episode = num_episode + 1.

#### 4. Experiment Results

In the past, the company used a manual process to conduct chip attach scheduling. A heuristic algorithm called Largest Weight First (LWF) was used as follows.

*Algorithm 9 (Largest Weight First (LWF) heuristics). *Initialize SM with the set of all machines (i.e., ) and define SJ as Step 1 in Algorithm 2. Initialize ) with each machine’s normal production time. Set , where is the initial production volume of job type .

*Step 1. *Schedule the job types in decreasing order of weights in order to meet their TPVs. While and do Choose job with = argmax. While and do Choose machine to process job , with = argmin. If , then set , , and remove from SM; else, set , and .

*Step 2. *Allocate the excess production capacity. If , then For each machine ), Choose job with , set , .

The chip attach station consists of 10 machines and normally processes more than ten job types. We selected 12 sets of industrial data for experiments comparing the -learning algorithm (Algorithm 8) and the six heuristics (Algorithms 1–5, 9): WSPT, MWSPT, RA, LFM-MWSPT, LFM-RA, and LWF. For each dataset, -learning repeatedly solves the scheduling problem 1000 times and selects the optimal schedule of the 1000 solutions. Table 1 shows the objective function values of all datasets using the seven algorithms. Individually, any of WSPT, MWSPT, RA, LFM-MWSPT, and LFM-RA obtains larger objective function values than LWF for every dataset. Nevertheless, taking WSPT, MWSPT, RA, LFM-MWSPT, and LFM-RA as actions, -learning algorithm achieves an objective function value much smaller than LWF for each dataset. In Tables 1–4, the bottom row presents the average value over all datasets. As shown in Table 1, the average objective function value of -learning is only 12.233, less than that of LWF, 66.147, by a large amount of 80.92%.

Besides objective function value, we propose two indices, unsatisfied TPV index and unsatisfied job type index, to measure the performance of the seven algorithms. Unsatisfied TPV index (UPI) is defined as formula (32) and indicates the weighted proportion of unfinished Target Production Volume. Table 2 compares UPIs of all datasets using seven algorithms. Also, any of WSPT, MWSPT, RA, LFM-MWSPT, and LFM-RA individually obtains larger UPI than LWF for each dataset. However, -learning algorithm achieves smaller UPI than LWF does for each dataset. The average UPI of -learning is only 0.0402, less than that of LWF, 0.0842, by a large amount of 52.20%. Let denote the set . Unsatisfied job type index (UJTI) is defined as formula (33) and indicates the weighted proportion of the job types whose TPVs are not completely satisfied. Table 3 compares UJTIs of all datasets using seven algorithms. With most datasets, -learning algorithm achieves smaller UJTIs than LWF. The average UJTI of -learning is 0.0802, which is remarkably less than that of LWF, 0.1176, by 31.81%. Consider

Table 4 shows the total setup time of all datasets using seven algorithms. For the reason of commercial confidentiality, we used the normalized data with the setup time of a dataset divided by the result of this dataset using -learning. Thus, the total setup times of all datasets by -learning are converted into one and the data of the six heuristics are adjusted accordingly. -learning algorithm requires more than twice of setup time than LWF does for each dataset. The average accumulated setup time of LWF is only 41.58 percents of that of -learning.

The previous experimental results reveal that for the whole scheduling tasks, any individual one of the five action heuristics (WSPT, MWSPT, RA, LFM-MWSPT, and LFM-RA) for -learning performs worse than LWF heuristics. However, -learning greatly outperforms LWF in terms of the three performance measures, the objective function value, UPI, and UJTI. This demonstrates that some action heuristics provide better actions than LWF heuristics at some states. During repeatedly solving the scheduling problem, -learning system perceives the insights of the scheduling problem automatically and adjusts its actions towards the optimal ones facing different system states. The actions at all states form a new optimized policy which is different from any policies following any individual action heuristics or LWF heuristics. That is, -learning incorporates the merit of five alternative heuristics, uses them to schedule jobs flexibly, and obtains results much better than any individual action heuristics and LWF heuristics. In the experiments, -learning achieves high-quality schedules at the cost of inducing more setup time. In other words, -learning utilizes the machines more efficiently by increasing conversions among a variety of job types.

#### 5. Conclusions

We apply -learning to study lot-based chip attach scheduling in back-end semiconductor manufacturing. To apply reinforcement learning to scheduling, the critical issue being conversion of scheduling problems into RL problems. We convert chip attach scheduling problem into a particular SMDP problem by Markovian state representation. Five heuristic algorithms, WSPT, MWSPT, RA, LFM-MWSPT, and LFM-RA, are selected as actions so as to utilize prior domain knowledge. Reward function is directly related to scheduling objective function, and we prove that maximizing the accumulated reward is equivalent to minimizing the objective function. Gradient-descent linear function approximation is combined with -learning algorithm.

-learning exploits the insight structure of the scheduling problem by solving it repeatedly. It learns a domain-specific policy from the experienced episodes through interaction and then applies it to latter episodes. We define two indices, unsatisfied TPV index and unsatisfied job type index, together with objective function value to measure the performance of -learning and the heuristics. Experiments with industrial datasets show that -learning apparently outperforms six heuristic algorithms: WSPT, MWSPT, RA, LFM-MWSPT, LFM-RA, and LWF. Compared with LWF, -learning achieves reduction of the three performance measures, respectively, by an average level of 52.20%, 31.81%, and 80.92%. With -learning, chip attach scheduling is optimized through increasing effective job type conversions.

#### Disclosure

Given the sensitive and proprietary nature of the semiconductor manufacturing environment, we use normalized data in this paper.

#### Acknowledgments

This project is supported by the National Natural Science Foundation of China (Grant no. 71201026), Science and Technological Program for Dongguan’s Higher Education, Science and Research, and Health Care Institutions (no. 2011108102017), and Humanities and Social Sciences Program of Ministry of Education of China (no. 10YJC630405).