Abstract

The slack-based algorithms are popular bin-focus heuristics for the bin packing problem (BPP). The selection of slacks in existing methods only consider predetermined policies, ignoring the dynamic exploration of the global data structure, which leads to nonfully utilization of the information in the data space. In this paper, we propose a novel slack-based flexible bin packing framework called reinforced bin packing framework (RBF) for the one-dimensional BPP. RBF considers the RL-system, the instance-eigenvalue mapping process, and the reinforced-MBS strategy simultaneously. In our work, the slack is generated with a reinforcement learning strategy, in which the performance-driven rewards are used to capture the intuition of learning the current state of the container space, the action is the choice of the packing container, and the state is the remaining capacity after packing. During the construction of the slack, an instance-eigenvalue mapping process is designed and utilized to generate the representative and classified validate set. Furthermore, the provision of the slack coefficient is integrated into MBS-based packing process. Experimental results show that, in comparison with fit algorithms, MBS and MBS’, RBF achieves state-of-the-art performance on BINDATA and SCH_WAE datasets. In particular, it outperforms its baseline MBS and MBS’, averaging the number increase of optimal solutions of 189.05% and 27.41%, respectively.

1. Introduction

As a classical discrete combinatorial optimization problems [1, 2], the bin packing problem (BPP) [3, 4] aims to minimize the number of used bins to pack items and it is NP-hard [5, 6]. In the past few decades, four main approaches have been extensively studied to resolve the BPP, such as exact approaches [79], approximation algorithms [4, 10], heuristic algorithms, and metaheuristic algorithms [11, 12]. The exact algorithms typically prune the lower bound information to address the BPP, which is suitable for small-scale instances. When the scale of datasets increases, the BPP becomes challenging to the approximation algorithms. The implementation of the metaheuristic algorithms is difficult due to the rigorous requirements on parameter adjustment and calculation complexity [13]. In the contrast, the heuristic algorithm is a popular bin packing method due to its efficiency on solving NP-hard problems.

As one of typical heuristic algorithms, the minimum bin slack (MBS) is particular useful to problems where an optimal solution requires most of the bins, if not all, to be exactly filled [14]. It is also useful for solving the problems where the sum of requirements of items is less than or equal to twice the bin capacity. In MBS, the selection of the packing sequence of the items is based on a predetermined strategy, which ignores the sampling deviation between the data of the items to be packed and cannot explore the global data space.

Therefore, the MBS algorithm may quickly fall into local optimal solutions ignoring the exploration of global item space in the training process. In the stage of iterative training, the deviation of the locally optimal solution is accumulated continuously, and the global optimal solution space is shifted steadily. It may result in a significant difference between the algorithm’s packing result and the optimal solution, which may lead to the failure to achieve the desired performance [14].

In order to solve the problems of MBS described above, we propose a reinforced bin packing framework, dubbed RBF, to resolve the BPP, where a reinforcement learning (RL) method, i.e., the Q-learning algorithm, is exploited to select a high-quality slack for the packing process. The RBF treats Q-learning as a prior data spatial information detector. To ingeniously select data samples as representatives of the datasets, it explores an intrinsic spatial distribution of sample bins by interacting with the environment and estimating the optimal slack of the global bins. The learned slacks are finally exploited in the improved MBS algorithm to pack items.

The proposed RBF can be distinguished from previous work in terms of the following characteristics:(1)The reinforced learning algorithm is exploited to generate the slack automatically, which is further integrated to the MBS algorithm. With high-quality slacks, automatically, rather than manual design or empirical speculation, our method prevents the bin packing process from falling into a local optimal solution, which is a quite challenging problem especially for the large-scale dataset.(2)The instance-eigenvalue mapping function is introduced to efficiently select representative and classified validate set of the input instances based on their similarity. This enables RBF to reduce the learning cost while generating a dynamic slack during the packing process.

The rest of this paper is organized as follows. Related work is presented in Section 2. The formulation of the BPP is depicted in Section 3. In Section 4, we briefly overview the design of the RBF and then detail its key components, such as the RL-system, the reinforced-MBS strategy, and the instance mapping process. Experimental results and theoretical analyses are presented in Section 5. Finally, conclusions are drawn in Section 6.

Existing methods that address the BPP can be roughly classified into four major categories: the exact approaches [1520], the approximation algorithms [4, 21, 22], the heuristic algorithms [14, 2327], and the metaheuristic algorithms [1113, 2830]. In recent years, the RL-based methods [3135] have also been proposed to resolve the BPP.

2.1. Exact Approaches

The exact approaches establish mathematical model and obtain the optimal solution of the problem by solving the mathematical model through optimization algorithms. CPLEX [36] solved the problem with mixed integer programming. Polyakovskiy and M’Hallah [15] characterized the properties of the two-dimensional nonoriented BPP with due dates, which packed a set of rectangular items, and experimentally proved that a tight lower bound enhanced an existing bound on maximum lateness for 24.07% of the benchmark instances. Since the quality of the solution depends on whether the model is reasonable or not, they are only applicable to small-scale instances.

Subsequent improvements focused on reconsideration about constraints in a novelty manner. Chitsaz et al. [18] proposed an algorithm to separate the subcontour elimination constraints of fractional solutions to solve production, inventory, and inbound transportation decision problems. The inequalities and separation procedures were used in a branch-and-cut algorithm. A similar idea was proposed in Mara’s work [20], where an exact algorithm was proposed based on the classic -constraint method. The method addressed N single-objective problems by using reduction with test sets instead of an optimizer. Besides, one classic method belonging to this group was the arc-flow formulation method [9] which represented all the patterns in a very compact graph based on an arc-flow formulation with side constraints and can be solved exactly by general-purpose mixed integer programming solvers. Generally, when the scale of the problem becomes larger, the phenomenon of “combinatorial explosion” will lead to heavy computational overhead in the optimization process. It is difficult for the exact algorithm to be applied to large-scale combinatorial optimization problems.

2.2. Approximation Algorithms

Approximation algorithms are popular because their time complexity is polynomial, while they do not guarantee to find the optimal solution. Typical approximation algorithms include greedy algorithms and local search. Based on the observation and arrangement of Earth observation satellites, the authors in [21] proposed an index-based multiobjective local search to solve multiobjective optimization problems. Kang and Park [37] considered the problem of variable-size bin packaging and described two greedy algorithms. The objective was to minimize the total cost of used bins when the unit size cost per bin did not increase as the bin size increased. Moreover, the survey [38] presented an overview of approximation algorithms for the classical BPP and pointed out that although the approximation algorithms are universal, there is always a gap between the solution and the optimal solution under the polynomial time complexity. However, approximation algorithms are commonly subjective to polynomial time and cannot give guarantees of solutions.

2.3. Heuristic Algorithms

Heuristic algorithms are based on the intuitive and empirical design. Several new heuristics for solving the one-dimensional bin packing problem are presented [39]. Coffman and Garey [10] reviewed various heuristic algorithms, such as NF (Next Fit), FF (First Fit), BF (Best Fit), and WF (Worst Fit) [23]. These are typical online packing algorithms [40, 41] and are called as fit algorithms. Their corresponding offline packing algorithms are NFD [24], FFD, BFD, and WFD [23], which differ significantly from online packing algorithms in which offline algorithms rely on overall information for sorting. The fit algorithms, for example, FF, WF, and BF, give priority for further packing to the bins that have already been packed with items, and a new bin will be activated only when there are no suitable nonempty bins for the current item. The strategy adopted by the fit algorithms ensures that each arriving item can always find a bin to be accommodated. However, it cannot guarantee that the item is the target item for optimum solution under the current situation. To address the issue, Cupta and Ho [14] proposed MBS, which mainly centered on bins and tried its best to find the collection of items that fill the bins. One problem with this method is that its sequence selection strategy often falls in the local region of the input space, which makes it hard for accurate estimation of the slack. Thus, it may result in a locally optimal solution. To solve the above problems, some methods have been proposed. Fleszar and Hindi [42] found that one effective hybrid method integrated perturbation MBS’ and a good set of lower bounds into variable neighbourhood search (VNS), so as to improve its ability in reasonably short processing times. However, due to the complexity and uncertainty of combinatorial optimization problems, heuristic algorithms that rely on empirical criteria are not always reliable.

2.4. Metaheuristic Algorithms

Metaheuristic algorithms are widely used to find optimal solutions for solving problem of BPP. Early typical representatives include genetic algorithms [28] and simulated annealing algorithms [29]. The former is a promising tool for the BPP and one significant improvement is mainly used: grouping genetic algorithms (GGAs). Dokeroglu and Cosar [43] proposed a set of robust and scalable hybrid parallel algorithms. In GGA-CGT (grouping genetic algorithm with controlled gene transmission) [44], the transmission of the best genes in the chromosomes was promoted while keeping the balance between selection pressure and population diversity. Kucukyilmaz and Kiziloz proposed island-parallel GGA (IPGGA) in [45]. It realized the choice of communication topology, determined the migration and assimilation strategies, adjusted the migration rate, and exploited diversification technologies. Crainic et al. [46] proposed a two-level tabu search for the three-dimensional BPP by reducing the size of the solution space. Kumar and Raza [47] incorporated the concept of Paretos optimality for the BPP with multiple constraints and then proposed a family of solutions along the trade-off surface. However, due to the lack of particle diversity in the later stage of genetic algorithms as well as PSO algorithms, premature convergence always occurs [28].

2.5. RL-Based Methods

Machine learning has been extensively studied to resolve the NP-hard BPP by scholars in recent years. Ruben Solozabal’s model tackled the BPP with RL. It trained multistacked long short-term memory cells to perform a recurrent neural network agent, which could embed information from the environment. The performance of the model was just comparable to the FF algorithm when introducing neural network overhead. Inspired by Pointer Network [48], a deep learning technology was successfully applied to learn and optimize the placing order of items [32], solved the classic TSP problem [33], and tackled the 3D BPP. These methods utilized RL to ensure the solution would not converge to local optimum, while they attempted to exploit neural networks [49] in the RL to solve the BPP, which increased the computational cost and time complexity. Heuristic algorithms rely on empirical criteria to consider predetermined strategies and ignore the dynamic exploration of the global data space in BPP. RL-based methods can intelligently mine data information from the environmental space through trial and error. Perhaps, it can help the existing heuristic algorithms to fully explore the effective information in the sample space, which inspired our method.

3. Formulation of the BPP

The classic one-dimensional BPP is formalized as follows. It is assumed that there are n items to be packed into bins with equal capacity . The general objective is to find a packing way to arrange all items with the minimum number of bins, of which the formal mathematical description can be defined as :

Therein, represents the indicator whether the th bin is used or not. A value of 1 indicates that the bin is used, and a value of 0 indicates that it is not used. Note that once the bin is used, the total load of the items placed in cannot exceed the capacity of . Thus, we havewhere means the load of the th item and is an indicator whether the th item is packed into the th bin or not. Especially,  = 1 if the th item is placed into the th container, otherwise  = 0. Furthermore, an equally fundamental constraint is that each item is just placed into one bin:

The detailed explanation of parameters for the formalization is defined in Table 1.

4. Design of RBF

In this section, the design of the proposed RBF framework is presented. First, the overview of the RBF is outlined, and then, the details of its key components, such as the RL-system, the reinforced-MBS strategy, and the instance mapping process, are presented.

4.1. Overview

The classical MBS algorithm follows two steps:(1)Utilize lexicographic search optimization procedure [14], also referred as the algorithm, to find the item set that should be allocated to the bin (2)Utilize Step 1 to traverse all items to be packed and the minimum bin slack is , where is the load of the packed th item

The steps above means that the slack is utilized to jump out of the optimal local trap randomly in the classical MBS algorithm, while the exact distribution of the sample space is ignored. To resolve the instability of the random slack, a new bin packing framework, RBF, is presented, where the slack is learnable and adjusted according to the samples’ structure.

The framework of RBF is illustrated in Figure 1, which consists of a RL-system, a reinforced-MBS strategy, and an instance-eigenvalue mapping process, and defined as follows:(1)RL-system: the RL-system is used to generate a suitable slack by a reinforcement learning strategy, where the best action selection strategy is controlled by Q-agent(2)Reinforced-MBS strategy: with the provision of the slack coefficient from the RL-system, the reinforced-MBS strategy is exploited to resolve the packing process(3)Instance-eigenvalue mapping: instead of using the whole dataset directly, the instance-eigenvalue mapping is utilized to generate the representative and classified validate set for the RL-system based on the similarity of the input instances

The main idea of RBF is to utilize the RL-system to learn the slack according to the spatial variation of the sample dataset, and then, the slack can be adapted to the distribution of bins and the remaining items in the data space during the iterative packing process. With the instance-eigenvalue mapping, the representative and classified validate set of the input instances is generated. The validate set is further integrated into the RL-system, where an adaptive slack is generated by the Q-agent. The coefficient of the slack is finally applied in the reinforced-MBS strategy for the packing process.

4.2. Instance-Eigenvalue Mapping

To reduce the amount of calculation for the slack, the representative items are selected for the Q-agent, which can learn the data space without traversing all instances. Here, an instance classification method, called as instance-eigenvalue mapping, is proposed and defined aswhere is an given instance, is the average value of the items in the th instance in the dataset, and , respectively, represent the minimum value and maximum value of the items in the th instance, and denotes the instances eigenvalue of the th instance.

According to the value of the instance-eigenvalue, the whole instances are reordered. The dataset can be divided into different subsets . The last instance of each subset is taken to form a validation set. Then, the validation set is utilized to iteratively learn the slack. Therefore, at each time step in RBF, instead of using the whole instances, Q-agent utilizes the validation set to reduce the repetitive work of the system.

4.3. RL-System

The validate set is integrated into the RL-system with a Q-learning algorithm [50], where Q-agent is utilized to learn the appropriate strategy and then improve the MBS strategy by selecting high-quality slack.

The process of the RL-system can be described as Markov decision processes (MDP) which is represented as a tuple . In the decision-making process of MDP, is the state set, is an action set, is the transition probability between states, is the return value after taking a certain action to reach the next state, and is the discount factor. To be adaptive to the packing circumstances, for example, the current distribution of containers and the remaining items, we proposed a slack learning algorithm and the detailed process is shown in Algorithm 1, where the parameters are illustrated in Table 2. By observing the current state of the environment, Q-agent selects one action that maximizes the value of reward function according to the state observed. With Q-agent continually interacting with the environment, we explore an suitable data selection strategy of the slack coefficient. The algorithm returns both a reward and a new state to Q-agent in each packing iteration, of which the change of states depends on the state transition probability :

Input: training data with items, container list with capacity , remaining capacity of the bin, learning rate , discount factor , and the iterative number .
(1)Initialize Q-table;
(2)for episode in range do
(3)
(4) Initialize container list [1, ȷ, n];
(5)
(6)while not do
(7)  According to state and Q-table, use epsilon greedy strategy to select actions ;
(8)   [1, i, n] =  [1, j, n]− [1, k, n];
(9)  Calculate immediate Reward and get next State ;
(10)  get from ;
(11)  if is not “terminal” then
(12)   ;
(13)  else
(14)   
(15)   
(16)  end if
(17)  Update ;
(18)   [1, j, n] =  [1, j, n]−;
(19)end while
(20)
(21), ;
(22)end for
Output: Q-table, .

The agent receives the performance-driven reward , and then, the sum of discount reward at time step is represented as :

Therein, , and it defines the weight of future reward and discount reward in the sum of reward. The closer is to 0, the more incentive is to consider short-term benefits. The closer is to 1, the more incentive is to consider long-term benefits.

The goal of the Q-agent at each time step is to select an action that can maximize future discount rewards by finding an optimal policy . Here, is the strategy of taking the optimal action at state , while is the strategy of taking action at state . Under the policy , is defined as the expectation of the state-action value function. When the agent takes at , is represented aswhere is the expected function.

The maximum state-action function over all policies is represented in

The update rule of value is shown inwhere is the learning rate of the RL agent.

At each time step , Q-agent observes the current state and selects the action from a discrete set of behaviors , where the value of is equal to the number of the items to be packed. At the beginning, the action is randomly initialized, that is to say, the action corresponding to the random number between 1 and is selected. Then, the RL-system selects the action that can maximize the value at each time step :

The agent uses a greedy learning strategy [51] to choose actions. It selects actions according to the optimal value of Q table with probability and randomly selects with probability.

The state is represented as the remaining space capacity of the bin after each round of packing. At each time step , the remaining items prefer to be packed into as few bins as possible. When the bin is full, the agent is given a reward . If the bin is overflowing, the agent is punished severely and it is told that state like this is not allowed.

The slack is defined as

Therein, is a constant, is the immediate reward achieved by Q-agent, and represents the initial value in the first iteration process. By returning the reward value , the slack is adjusted in each packing round accordingly. The slack can be changed in a range with the change of the reward value . Ultimately, the new round of environment is updated as the subtraction of bin capacity and slack .

Q-agent captures this intuition through performance-driven rewards . At each time step , the agent’s reward is defined as . Therein, is the number of bins that are exactly filled, is the number of bins that are filled in the slack space, is the weight coefficient of positive reward, is the number of overflowing bins, is the punishment coefficient of negative reward less than 0, and is a constant to regulate the value of the entire reward function .

4.4. Reinforced-MBS Algorithm

By introducing the slack learned by the RL agent into the MBS, we propose the Reinforced-MBS algorithm. Therein, is defined aswhere is the slack parameter learned by the agent applying RL. It is calculated as by minimizing the number of used bins on the validation set. Then, the coefficient of slack is passed into our reinforced-MBS algorithm. In detail, the idea of the reinforced-MBS algorithm is shown in Algorithm 2.

Input: training data with items, container list with capacity ; set , , for ,
(1)Initialize , ;
(2)generate random Slack ;
(3)for to itemListdo
(4) Use the improved dictionary search procedure to find the set of items that should be allocated to the bin of
(5)ifthen Pack into the bins ;
(6)else
(7)  
(8)end if
(9)end for
(10)
Output: bins with item set , .

In Algorithm 2, the improved dictionary search procedure is utilized to find the set of items that should be assigned to the bin during the iterative process. The improved dictionary search procedure is shown in Algorithm 3.

Input: for  = 1,…,; ;  = 1, , where ; ; .
(1)Generate Slack .
(2)Step 1:
(3)if 0 then
(4)if =  , go to Step 3.
(5)else
(6) Find the arrangement of the number of the last item in the temporary item list in the original item list, that is, to find makes .
(7)ifthen
(8)  
(9)  ifthen
(10)   , prepare for packing
(11)  end if
(12)  Step 2:
(13)  ifthen
(14)   , go to Step 1.
(15)  else
(16)   ifthen
(17)    go to Step 3
(18)   else
(19)    . Find q makes , go to Step 2.
(20)   end if
(21)  end if
(22)else
(23)  go to Step 2
(24)end if
(25)end if
(26)Step 3: Place the items in into the Kth bin.
Output:.

5. Experimental Evaluation

In this section, experiments are carried out to verify the effectiveness and robustness of the proposed RBF. First, experimental evaluation indexes are introduced and the datasets used in the experiments are detailed. Afterwards, the experimental results are presented and analyzed. Finally, the robustness and stability of our method is discussed.

5.1. Evaluation Indexes and Datasets
5.1.1. Competition Ratio

The competition ratio [52, 53], , is defined aswhere represents the number of bins used by the concrete algorithm and is the number of bins in the optimal solution for the packing instance. The competition ratio equal to 1 means that the algorithm has found the optimal solution.

Generally, OPT() has a lower limit, as shown in formula (14), where is the ceiling function. Due to the limitation of bin packing conditions, the number of bins used in each bin packing iteration cannot be less than the ratio of the total load of the items to the capacity of a single bin:

5.1.2. FSOL

For a dataset, FSOL represents the number of feasible optimal solution instances achieved by the algorithm; in other words, the number of instances whose is 1. For the specialized algorithm alg and the dataset data, is specialized as .

5.1.3. Realization Rate

Realization rate (RT) is defined as formula (15), where INS is the number of instances in the packing dataset:

5.1.4. Gap

is referred to the deviation between the number of used bins obtained by the algorithm and the optimal number of the packing. The relative is exploited to evaluate the performance of the algorithms, which is calculated as

The [54] and [55] datasets are used in the experiment for evaluation. Therein, dataset includes three subsets, such as Bin1data, Bin2data, and Bin3data. The details of datasets are shown in Table 3, such as the number of instances, weight of items, capacity of bins, and the number of items in the instances.

5.2. Experimental Results and Analysis

The performance of the RBF is compared with that of the classical Fit algorithms, the MBS algorithm, and the MBS’ algorithm on and datasets shown in Table 3. For each instance of each dataset, the number of items in each category is the same. The experimental results reported in this paper are the average of ten runs under per hyperparameter settings.

5.2.1. Results on

Table 4 lists the results of FSOL, RT, and of the algorithms involved in comparison on , while Table 5 lists the value of . In comparison with the classical heuristic algorithms, such as NFD, FFD, WFD, AWFD, BFD, MBS, and MBS’, RBF obtains the maximum on , while its and are minimum. Furthermore, the improvement of , represented as and defined as formula (17), is further calculated, where and . Especially, for Bin1data and Bin2data, is 165.08% and 179.2%, respectively, while is 5.53% and 41.3%, respectively. For the dataset Bin3data, RBF is the only one that can obtain 2 optimal solutions in the total 10 cases, while the others obtain zero optimal solutions:

5.2.2. Results on

The results of the compared algorithms on are listed in Table 6. And, the detail results of RBF on , that is, SOL, OPT, and the time cost (Runtime) on each instance, are shown in Tables 7 and 8. In the setting of Table 7, the number of items is 100 and the container capacity is 1000. Table 8 shows the statistical results when the number of items is 120. Meanwhile, of each algorithm are shown in the last column of Table 5. It is shown that, in comparison with other algorithms on , RBF achieved the minimum and Gap, but the maximum was FSOL and RT. Especially, for , is 472%, while is 346.88%.

5.2.3. Cumulative Results

For all the instances in and , the cumulative packing results of the compared algorithms, such as the cumulative , the average RT, the average , and the average , are shown in Table 9. It is shown that RBF obtained 1162 cumulative and the RT was 82.41%, which greatly overwhelmed other compared algorithms. Especially, according to formula (17), the cumulative improvement of RBF to MBS and MBS’, correspondingly, and , is 189.05% and 27.41%, respectively. Figure 2 graphically shows the statistical of the comparison algorithm on each dataset. It is noted that, from the radar in Figure 2, the quantitative curve of RBF for is at the outermost. It means that RBF achieved the largest number of optimal solutions in the test cases. Also, RBF achieved the minimum and Gap. Results indicate that, in comparison with the typical heuristic algorithms, RBF has a stronger global optimal performance. Overall, the proposed RBF obtains first ranks in all metrics according to the results of all compared algorithms listed in Tables 46. In detail, the improvement in of uncomplicated datasets, Bin1data, is limited among all compared methods. RBF achieves best results and wins with few advantages. However, RBF works much better than other methods on difficult datasets, bin2data, bin3data, and . Besides, RBF achieves huge advantages in , and it wins almost all other methods on used datasets.

5.3. Robustness and Stability

The construction of the validation set is a key procedure of RBF. This experiment is carried out to verify the validity of the eigenvalue mapping function on the Bin1data. Since 10 instances of the Bin1data are selected to form the validation set by the eigenvalue mapping function, here different selection policy are applied for comparison in the packing process. The first policy is that the first 10 instances of the Bin1data are selected, the second policy is that the last 10 instances of the Bin1data are selected, and the third policy is that 10 random instances of the Bin1data are selected to form the validation set. The packing results with different selection policies are depicted in Table 10. It can be seen that the value of slack learned by Q-agent is different with different selection policies. Especially, with the selection policy of the eigenvalue mapping function, RBF achieved the maximum FSOL and RT, while the minimum was and Gap. The results verified the validity of the eigenvalue mapping function, which helped RBF achieved better performance.

6. Conclusion and Future Work

In this paper we propose reinforced bin packing framework (RBF) to tackle the one-dimensional BPP. The proposed RBF consists of three main components: the RL-system, the instance-eigenvalue mapping process, and the reinforced-MBS strategy. The RL-system is designed to construct a slack selection policy automatically by Q-agent to select high-quality slack for the heuristic algorithm integrated in RBF. The instance-eigenvalue mapping process is utilized to generate the representative and classified the validate set based on the similarity of the input instances, which greatly eliminates the computational overhead and improves the generalization performance of the model. Finally, with the provision of the slack coefficient from the RL-system, the reinforced-MBS strategy is exploited to resolve the packing process. We evaluate our models on BPP tasks, where RBF exhibits excellent packing ability and experimental results validate its superior performance compared to state-of-the-art proposals on and datasets. Compared to its baseline methods, MBS and MBS’, the average number of optimal solutions achieved by RBF increases by 189.05% and 27.41%, respectively.

For future work, we plan to investigate slack selection policies and new mechanisms to learn them automatically. We also foresee the extension of our method to more complex multiagent reinforcement learning frameworks, where the use of new aspects of the multiagent communication environment is crucial to boost the packing performance.

Data Availability

The datasets used in this paper contain one-dimensional bin packing datasets, such as BINDATA and SCH_WAE datasets, which can be found in http://people.brunel.ac.uk/∼mastjjb/jeb/orlib/binpackinfo.html.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61472139) and the Shanghai 2020 Action Plan of Technological Innovation (no. 20dz1201400).