Next Article in Journal
Global Flood Disaster Research Graph Analysis Based on Literature Mining
Previous Article in Journal
Semantic Modeling Approach Supporting Process Modeling and Analysis in Aircraft Development
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Deep Reinforcement Learning-Based Scheme for Solving Multiple Knapsack Problems

1
Korea Institute of Energy Technology (KENTECH), Naju-si 58217, Korea
2
Defense AI Technology Center, Agency for Defense Development (ADD), Daejeon 34186, Korea
3
AI Graduate School, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(6), 3068; https://doi.org/10.3390/app12063068
Submission received: 15 February 2022 / Revised: 13 March 2022 / Accepted: 14 March 2022 / Published: 17 March 2022
(This article belongs to the Topic Complex Systems and Artificial Intelligence)

Abstract

:
A knapsack problem is to select a set of items that maximizes the total profit of selected items while keeping the total weight of the selected items no less than the capacity of the knapsack. As a generalized form with multiple knapsacks, the multi-knapsack problem (MKP) is to select a disjointed set of items for each knapsack. To solve MKP, we propose a deep reinforcement learning (DRL) based approach, which takes as input the available capacities of knapsacks, total profits and weights of selected items, and normalized profits and weights of unselected items and determines the next item to be mapped to the knapsack with the largest available capacity. To expedite the learning process, we adopt the Asynchronous Advantage Actor-Critic (A3C) for the policy model. The experimental results indicate that the proposed method outperforms the random and greedy methods and achieves comparable performance to an optimal policy in terms of the profit ratio of the selected items to the total profit sum, particularly when the profits and weights of items have a non-linear relationship such as quadratic forms.

1. Introduction

A knapsack problem is one of combinatorial optimization problems. Extending the single knapsack problem, the multiple knapsack problem (MKP) is a problem that finds disjoint subsets to maximize the total profit in knapsacks. In a nutshell, each subset can be distributed to different knapsacks, as long as the total weight of each subset is equal to or less than the capacity of the corresponding knapsack. MKP is formulated as follows:
max x j i p i x i j subject to i w i x i j c j , j x i j 1 for i I , j J , x i j { 0 , 1 }
where J is the pool of knapsacks, I is the pool of items, p i is the profit of the i-th item, w i is the weight of the i-th item, and x i j is a decision variable to indicate whether the i-th item is selected to be put in the j-th knapsack. The exact method has the computational complexity of O ( ( J + 1 ) | I | ) .
MKP arises in substantial real-world cases like vehicle/container loading, production scheduling, and resource allocation in the computer network systems [1,2]. Lahyani et al. [2] formulated production scheduling as MKP. It deals with assigning items to a period of the production schedule to maximize the profit of production and minimize the cost keeping the constraint of capacities. Kumaraguruparan et al. [3] formulated scheduling appliances as MKP in smart grid infrastructure. To minimize electricity bills in a household, it considers an appliance as an item, its energy consumption as weight, and the time period as a knapsack. Ketyko et al. [4] formulated a multi-user computation offloading problem as MKP. To maximize the profit of user equipment, it considers user equipment as items, requested CPU as weights, and CPU capacities of mobile edge computing server as capacities of knapsacks. Cappanera et al. [5] formulated Virtual Network Functions (VNF) placement problems as MKPs. It considers a data center as a knapsack, a service request as an item, and the quantity of the service requests as weight. Then, it solves the problem to maximize the total priority level of the requests. In various application fields, solving MKP with high performance in practical time to maximize revenue is unavoidable.
In MKP, there is a trade-off between empirical performance and reduced computational complexity. Note that MKP is an NP-hard problem [6]. Traditional approaches, such as heuristic or genetic algorithms, focus on reducing their computational complexity. These days, the development of deep neural network (DNN) accelerates the application of Machine Learning (ML) to various engineering problems. Reinforcement Learning (RL), one type of ML, is an experience-driven method that makes autonomous agents solve decision-making problems [7]. Deep reinforcement learning (DRL) has been applied not only to robot movement [8] and games such as AlphaGo [9] but also to discrete combinatorial optimization problems [10,11,12,13,14,15]. DRL agents trained with DNN can solve the high dimensional problems in a reasonable amount of time [2,16,17]. Recently, traveling salesman problem (TSP) and knapsack problem have been solved not only with supervised learning [10,18], but also with RL [11,19]. DRL with experience replay can be applied to the MKP problem because it can deal with various combinations of items and knapsacks in real-time without predefined massive data for training. In the literature, there are some approaches that solve a single knapsack problem, but there are lack of approaches for MKPs [10,11,19].
In this paper, we propose a DRL-based method to solve MKP. The proposed DRL method solves MKP by exploiting an existing Markov Decision Process (MDP) model for single knapsack problem in [11] and devising a novel capacity-aware knapsack selection method for MKP. Each state represents a separate problem state, and the agent’s action decides which item to be mapped to a knapsack with the greatest capacity. As a result, the proposed method makes a solution for an MKP by combining sequential actions made by an agent in each state. Our main contributions are as follows:
  • We propose a DRL-based method to solve MKP, in which the DNN model is extensively trained with various combinations of random items and knapsacks. The trained single DNN model has the capability of solving diversified MKPs with untrained instance sets of items and knapsacks.
  • We simplify the action space for MKP to make it possible to train the model in a scalable manner. The proposed method combines greedy sorting algorithms with the DRL training process, and the size of the action space is fixed to the number of items regardless of the number of knapsacks. It simplifies state transition pattern so that the agent can identify a change in the environment quickly during the training process.
  • We adopt the Asynchronous Advantage Actor-Critic (A3C) [20] to expedite the learning process for DNN model. The training is performed asynchronously with multiple learners. A global model is shared with each learner, which contributes to the global model updating at the end of each episode.
  • The experiments show that the DNN model can be successfully trained in a variety of configurations in terms of the number and capacity of knapsacks, and the number, weight, and profit of items. It is demonstrated that even when the weight and profits of items have a nonlinear relationship, the proposed method achieves comparable performance to an optimal policy.
The remainder of this article is organized as follows. In Section 2, we will summarize existing work on knapsack problems. In Section 3, we introduce a DRL method with a sorting algorithm for solving MKP. Section 4 compares our method with other baseline methods by evaluating computational results on random, linear, and quadratic instances. Finally, we conclude this paper in Section 5.

2. Related Work

2.1. Conventional Algorithms for Knapsack Problems

To reduce the complexity of an exact algorithm, heuristic algorithm and evolutionary algorithm solve single or multiple knapsack problems in a relaxed and greedy manner.

2.1.1. Exact Algorithms and Heuristic Algorithms

Martello and Toth proposed MTM, which is a branch and bound-based exact algorithm. The process of MTM includes surrogate relaxation and lower bounds which find an optimal solution heuristically for the current single knapsack one by one [6]. Mulknap is a branch and bound-based exact method proposed by Pisinger, which employs surrogate relaxation for upper bounds and the subset-sum problem for lower bounds. [21]. Martello and Toth proposed MTHM, which is introduced as a polynomial-time approximate solution. The summed process is composed of greedy mapping, rearrangement using reordering, swapping items between two knapsacks, replacing one item with a subset of unassigned items. It derives the computational complexity O ( | I | 2 )  [6]. For solving setup MKP, Lahyani et al. [2] proposed a matheuristic method. It generates heuristic-based solutions and tabu lists using period and class exchange. Dell’Amico et al. [22] developed a hybrid exact algorithm which combines Mulknap with decomposition method. It calls Mulknap algorithm (branch and bound-based algorithm) for τ seconds and iterates decomposition ν times.

2.1.2. Genetic Algorithm and Evolutionary Algorithm

Genetic algorithm (GA) is a representative approach to get a local optima using random search. The basic process of GA is as follows: (1) initially generate random solutions (2) evaluate solutions by their fitness value (3) select two parent solutions (4) regenerate solutions using genetic operator (crossover, mutation) [23]. Khuri et al. [24] used GA to solve 0/1 MKP. It generates initial random solutions and employs genetic operators (selection, crossover, and mutation) proportionally. The algorithm comes up with a solution toward maximizing fitness value which is total profit minus overfilled weights. Falkenauer [25] proposed the Grouping Genetic Algorithm (GGA) to solve bin packing problems. In GGA process, it initially generates population using copy and crossover. Fukunaga [26] proposed UGGA, which iteratively calls Fillundominated in the initialization stage of GGA. Fillundominated is an algorithm that finds an item to replace a subset of a knapsack with the same or greater weight than the subset’s total weight and the same or greater profit than the subset’s total profit. Kim and Han proposed Quantum-inspired Evolutionary Algorithm (QEA) to solve a single knapsack problem. QEA can generate a diversified population because Q-bit has a linear superposition of binary states [27]. In the process, the algorithm observes the state of quantum bits and iteratively compares the current solution with the previous.

2.2. Combinatorial Optimization with ML

Machine learning’s approximation reduces large amounts of calculations in decision-making problems. Various types of DNN based approaches such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) have been applied to solving a variety of optimization problems.

2.2.1. Combinatorial Optimization Problem (COP) with Pointer Network

Vinyal et al. [18] introduced pointer networks with RNNs to solve geometric problems. Two separated Long Short-Term Memory models (LSTM) RNNs are used; one is an encoder that encodes input sequence and the other is a decoder that derives output’s probability sequentially. Therefore, it can be applied to a problem in which output length depends on the length of the input sequence, such as selecting a point to visit at each time from a set number of points to visit. Bello et al. [19] applied pointer networks to A3C method to solve not only TSP but also single knapsack problems. The output of the decoder selects an item in each iteration time. The experiments for single knapsack problems use instances with fixed capacity in both training and test. Gu et al. [10] used pointer networks to solve single knapsack problems. Each Pointer is considered as an item and 0 indexed pointer means the end of the selected item. It employed a supervised loss function with a cross-entropy objective. Hu et al. employed pointer networks and DRL to solve 3D bin packing problems. The experiment result showed that the solution of the proposed method has the smallest surface area when compared to heuristics [28].

2.2.2. COP with Supervised/Unsupervised Learning

Rezoug et al. [29] proposed a supervised learning-based model to solve multidimensional knapsack problems. The proposed model is updated close to the optimal. It uses multiple regression including K-Nearest Neighbor (KNN) and Bayesian Automatic Relevance Determination (ARD). The model trained with small instances could solve the problem with large instances. Garcia et al. [30] designed unsupervised learning KNN to solve multidimensional knapsack problems. The designed method is a hybrid solution of KNN and metaheuristics, which are Particle Swarm Optimization (PSO) and Cuckoo Search (CS).

2.2.3. COP with DRL in MDP

Dai et al. [13] used structure2vec to solve COP in a weighted graph environment with tagged node state and partial tour length reward to reinforce the policy. Laterre et al. [31] formulated the process of solving bin packing problems as MDP process with DRL. The objective function is to minimize the cost. When items are all placed, then it gets 1 over cost as a reward and a reward buffer measures it by comparing the current reward with the best reward. The ranked reward is determined based on the evaluation, and it causes finding out a solution that outperforms the previous one. Olyvia et al. [14] used the Double Deep Q Network (DDQN) algorithm to solve a 2D bin packing problem. To get maximized empty space, it reinforces agent according to reward, which is either cluster size × compactness or a negative constant. The evaluation shows an increase in efficiency and decrease of loss according to the increase of learning iteration. Afshar et al. [11] proposed a state aggregation method to solve a single knapsack problem in MDP with Advantage Actor-Critic (A2C) algorithm. The result shows the proposed method outperforms other methods without aggregation. Zhang et al. [15] use the attention model to solve Dynamic Traveling Salesman Problem (DTSP). In DTSP the salesman decides the cities to visit on the travel while new customers can appear dynamically. Zhang proposed a DRL framework in an MDP where an agent model is composed of an encoder and a decoder and states are combined with static state and dynamic state.

3. Proposed Method

3.1. Actor Critic Background

In RL, an agent in a given state selects an action, and the environment returns reward and next states to the agent. State, action, reward, and next state sets from time t are used in MDP and can be denoted as the following form: ( s t , a t , r t , s t + 1 ) . The continuous set ( s t , a t , r t , s t + 1 ) of an episode is called a trajectory. The return value of a trajectory can be denoted as R = t = 0 T 1 γ t r t + 1 . γ is a discount factor less than 1, and r t + 1 is a reward caused by action in time t. The action in a state is determined by a policy which can be deterministic or stochastic. The goal of RL is to find a policy which maximizes the return value. A value function is a key role estimating how good the transition from a given state to the target state is. The value function at s t + 1 gets the expected accumulated reward from s t + 1 to the end of the episode, and it can be denoted by V ( s t + 1 ) = E [ R t + 1 ] .
There are two representative RL methods for a policy of an agent; value-based and policy-based. In a value-based learning, the policy is determined by the value functions. The estimation of the value function in a certain state is derived by updating it using the expected return values of the candidate states from the very next step. As a result, many iterations are required to get a converged expected accumulated reward at the first time step. Compared to the previous method, policy-based learning allows the policy to converge quickly without many iterations using a converged expected accumulated reward. One of policy-based learning, the policy gradient (PG) method directly updates the policy which has differential parameters toward maximizing the expected return value as follows:
J ( θ ) = E π θ [ R 0 ]
θ ( J ) = E π θ [ θ | R t log π θ ( a t , s t ) ]
In (2), π θ denotes the policy of the action following parameterized vector θ . The objective function J ( θ ) is equal to the expected return value got by π θ at time 0. To find π θ maximizing J ( θ ) is the fundamental goal. In (3), θ ( J ) is the partial derivative of J ( θ ) in (2) with respect to θ . R t is the return value in time t. By utilizing log-derivative trick, the sampled return value from an episode is directly used to update the policy, and it can cause the increase of the variance. An actor-critic algorithm can reduce the variance by substituting the estimated value per step. The estimated value can approximate a baseline and it can be denoted as V θ V ( s t ) b ( s t ) . The actor indicates a policy updated by PG method, and the critic is an approximated value, which estimates the return value by the actor.

3.2. System Model

Our work proposes state, action, and reward in MDP for solving an MKP by referring to and modifying parts of the existing work [11]. The state transition is deterministic here and the next state is decided by the current state and action.
State: The proposed state has profit and weight information on at most N items to be selected, and the capacity information of all knapsacks. Figure 1 shows the proposed states of MKP. Let | I | denote the number of items to be selected. Then, it indicates the number of problems for the agent to solve by selecting a single item at each time. Let c j denote the j-th largest capacity in the state. Each of the knapsack problems has a different target number, which is the number of not-fully-occupied knapsacks that can take more items. Because the target number of the knapsack is single, Afshar et al. [11] use one capacity field, whereas our method can use one or more capacity fields because we target not only a single knapsack but also multiple knapsacks. In Figure 1, it is assumed M is equal to the number of the knapsack. There are M capacity fields, which correspond to c 1 , c 2 , c 3 , ⋯, c M and are sorted in descending order of the capacity size. At the first, c 1 with the largest size of the capacity is located, at the second, c 2 with the second largest size of the capacity is located, and at the last, c M with the smallest size of capacity is located. In Figure 1, i p i and i w i are the summed profit of items and the summed weight of items, respectively. In the following, profits and weights information on the items are listed. Here, p n i is a normalized p i as shown in (4).
p n i = p i p max ,
where p i is divided by p max , which is the maximum size of the given profits. The pairs of p n i and w i / c 1 are listed in descending order of p i w i value ( i I ). In the fields of each item, p n i is located at the first, and w i c 1 is located at the second. As a result, p n 1 is located at the first, and w 1 c 1 is located at the second. On the third, p n 2 is located, and on the fourth w 2 c 1 is located. The left items are listed in the same way. If an item is selected for a certain knapsack or discarded, the state does not have the item’s information anymore. The left ( N | I | ) × 2 fields have zero values.
Action: In action space, there are N actions A = { 1 , . . . , N } . An action a A implies that the a-th item in item set I is selected and is put into the knapsack that has a maximum capacity c 1 . Note that the size of action space is fixed to N regardless of the number of knapsacks.
Reward functions: If the agent’s action successfully allocates the item to the knapsack, the environment returns a positive reward. When the item’s weight exceeds the capacity of the target knapsack, a negative reward is given. When the chosen action is greater than the number of remaining items and an already chosen item is chosen, a fatal negative reward is given compared to the previous negative case. The proposed reward value is given by
r t = p n i , w i c j η , w i > c j ξ , a | I | .
Our reward value is different from that in [11], which is given by
r t ^ = p i w i · c , w i c w i c , w i > c c , a | I | .
In the first case, the reward is set to p n i without considering the effect of weight and capacity so that the objective is closer to the MKP’s objective function in (1). In the second reward case of - η , we remove the effect of the weight of the violated item on the reward because the only thing the agent needs to know is whether the weight exceeds the capacity or not. Similarly, in the last reward case of - ξ , the only important thing is whether the item has already been chosen or not. Hence, we remove the effect of the capacity size compared with that in [11].

3.3. DRL Approaches

Our purpose is to obtain the maximum summed profits within reasonable computational steps according to the reward functions. To accomplish our purpose, A3C [20,32] is used for our policy model. The proposed policy model is parameterized by θ for the policy and θ v for the critic’s value function as done in [33]. Using E π θ [ θ | log π θ ( s , a ) V θ v ( s t ) ] = 0 , the expected return value in (2) and its derivative in (3) can be replaced with the expected advantage value and its derivative, respectively, as follows:
J ( θ ) = E π θ [ A π θ ( s , a ) ]
θ ( J ) = E π θ [ θ | log π θ ( s , a ) A π θ ( s , a ) ] ,
where A π θ ( s , a ) denotes advantage got by π θ . The advantage value is obtained by subtracting baseline value from the return value and is given by
A t π θ = r t + 1 + γ V θ v ( s t + 1 ) V θ v ( s t ) .
Note that the above equation in (9) gives one step advantage function.
According to the update interval, we use critic’s estimated value functions for getting each advantage value. Because A t π θ in (9) is same to one step temporal difference (TD) error, it can be represented as δ ( θ v t ) . As a result, the expected value of A π θ ( s , a ) and expected value of δ ( θ v t ) are the same.
E π θ [ δ ( θ v t ) ] = E π θ [ A t π θ ] .
Loss functions are composed of value loss and policy loss. Value loss is a mean squared error of TDs, and policy loss is negative log π θ ( a t | s t ) times constant TD error.
Loss = 1 T t = 0 T 1 ( δ ( θ v t ) ) 2 + ( log π θ ( a t | s t ) [ δ ( θ v t ) ] ) .
Under the assumption that the update interval is T, θ is updated according to the loss function as shown in (11).

3.4. Proposed Algorithm

We use multiple actor-learners to exploit different explorations in parallel to maximize diversity rather than one actor-learner model. It reduces correlation in accordance with a time in a single learner’s update method. Furthermore, it can shorten the training time and make the on-policy method more stable [32].
Algorithm 1 shows the asynchronous training process. Multiple subprocesses with their local model copy the global model’s parameters to the local model before starting the following episode. After the end of each episode and learning process of Algorithm 2, the global model’s parameters are updated to the trained local model’s.
Algorithm 1 Asynchronous process
  • Input: N (the number of items), M (the number of knapsacks), A L (the number of actor-learners)
  • Output: Trained global model
1:
Initialize a policy network π θ ( a t | s t ) with parameters θ for global model and value function V θ v ( s t ) with parameters θ v
2:
P r o c e s s m a x = A L , i = 0
3:
while i < P r o c e s s m a x do
4:
      Process i starts training process
5:
end while
Algorithm 2 runs on the process with the actor-critic model. Input is K problem instances with N items. The total iteration of training episodes is K × R in line 1. In every iteration, episode i uses the e-th instance in the range between 0 and ( K 1 ) in line 3. The number of initial items at each episode is a random integer number between 1 and N. The initial capacity of each knapsack is a random integer value between the minimum weight of the item sets and maximum capacity C. While there is at least one candidate item, each step of an episode iterates. The episode is over when capacities are not sufficient, or overall items are chosen (selected or discarded), i.e., | I | = 0 . An action is always chosen from the range between 1 and N. In line 12, the agent does an action according to the policy π θ ( a t | s t ) . If the chosen item is not selected or discarded yet, and its weight is less than or equal to c 1 , then the item is selected for the knapsack. As a result, capacity c 1 changes to current capacity minus the weight of the selected item according to action a (line 15). The item is classified as selected one and is excluded from the item set I (line 16 and 18). a i denotes an index of the initial item set. Subsequently, Algorithm 3 is called and constitutes the capacity features of the next state. The output of Algorithm 2 is decision variables of items for each instance in K.
Algorithm 3 is a greedy sort algorithm for the very next state following action in Algorithm 2. This sorts the capacities in the descending order, and it only deals with capacities that are larger than zero. Therefore, the process creates new sorted capacities c 1 , ⋯, c V for V M . Figure 2 shows an example of the sorting process. In state s t , knapsacks are sorted as a, b, c, d according to the initial capacities. The capacities are denoted by c a p a a , c a p a b , c a p a c , and c a p a d . When an item corresponding to the a c t i o n t is selected and the weight constraint condition is satisfied, c a p a a changes from the existing capacity to a value reduced by the weight of the item. Assume that the currently changed capacity of knapsack a is smaller than c a p a b and larger than c a p a c and c a p a d . At this point, the order of the knapsacks are sorted so that the subsequent state s t + 1 would contain the knapsacks’ current capacities that are c 1 = c a p a b , c 2 = c a p a a , c 3 = c a p a c , and c 4 = c a p a d . For the following action in state s t + 1 , the capacity of b changes, and knapsacks for c j are sorted again. When c a p a a becomes zero in s t + 2 , knapsack a becomes invalid. Then, the sorting process excludes the knapsack and sorts valid knapsacks whose capacities are larger than zero.
Algorithm 2 Training process
  • Input: K problem instances with N items and R repeat number, N (the number of items), M (the number of knapsacks), C (max capacity)
  • Output: Trained model, solutions of the K instances
1:
T r a i n i n g m a x = K × R , i = 0
2:
while i < T r a i n i n g m a x do     
3:
      e = i mod K, t = 0
4:
       d = random integer number 1 N
5:
       I = { 1 , , d }
6:
       c j = random integer number m i n ( w n , n I ) C
7:
      Sort items indicating I according to the profit over weight
8:
      Call Algorithm 3 for greedy sorting capacities
9:
      while  m i n ( w n , n I ) c 1  do
10:
           u = 0
11:
          while  u < u p d a t e   i n t e r v a l  do
12:
                 Do action a according to the policy π θ ( a t | s t )
13:
                 if  a < | I |  then
14:
                     if  w a c 1  then
15:
                          c 1 c 1 w a
16:
                          x a i 1
17:
                   end if
18:
                    I ( I { a } )
19:
             end if
20:
             Call Algorithm 3
21:
              u u + 1
22:
              t t + 1
23:
        end while
24:
        Update global model’s θ and θ v using (11)
25:
        Copy global model to local model
26:
     end while
27:
      i i + 1
28:
end while
Algorithm 3 Greedy sort
  • Input: Capacities and indices of knapsacks indicating c 1 , , c V
  • Output: Sorted information of knapsacks indicating c 1 , , c V
1:
Sort capacities of knapsacks indicating c 1 , , c V
2:
j = 1
3:
while j V do
4:
       c j capacity of a knapsack which has the j-th largest capacity
5:
       j j + 1
6:
end while

3.5. Constructed Neural Networks

Our parameterized policy model is DNN based model. The DNN model is made up of one convolution layer with two kernels and two strides, one fully connected layer with 512 nodes, and one output layer for each actor and critic. The fully connected layer’s activation function is Rectified Linear Unit (ReLU), and the actor’s output layer’s activation function is a sigmoid function.

4. Performance Evaluation

The proposed algorithm is programmed using Pytorch. For training the A3C-based DNN model, we use the discount factor γ near to 1 and a learning rate of 0.0001. The parameters of the model are initialized as random uniform values within [ 0.08 , 0.08 ] . There are three types of problem instances for training the proposed model: random instances (RI), linear instances (LI), and quadratic instances (QI). Random instances have uniformly random integer values in the range of [ 1 , 10 ] for both profit and weight. Linear instances have the same weight as RI, and profit of them has weight plus random float value in ( 0 , 1 ) . Quadratic instances have the same weight as RI, and profit of them has w e i g h t × ( 10 w e i g h t ) + 1 . In the problem instances, knapsack has the same capacity in 10 , 20 , 40 , 80 , 100 . The number of items is 50, and the number of knapsacks is 1 , 3 , 5 . To deal with diversified problems, we prepare various models, and each model solves problems of its type. Each model has one type of item set (one of RI, LI, QI), a fixed number of knapsacks, (one of 1, 3, 5), and a fixed maximum capacity C (one of 10, 20, 40, 80, 100). In the case of 5 knapsacks, there are three types of capacity (10, 20, 40) because more than 40 is sufficient to select all items. 1000 data sets are used for each problem instance, and the model’s parameters are updated each five steps, with the process repeated at least 40 times. Every test result is executed on a workstation equipped with an AMD Ryzen 9 5900 X 12-Core Processor 3.70 GHz and 32 GB RAM.
The comparable methods for the test are as follows.
  • Greedy algorithm: We use a simple greedy algorithm that finds an item that has maximized profit divided by weight ratio and tries assigning the item to a knapsack with maximized capacity.
  • Random solution: We repeat randomly generating solution of each data 1000 times and then select a solution that has a maximum profit ratio.
  • Gurobi optimizer: We use a mathematical optimization tool Gurobi for optimized solutions.

4.1. Training Process

During training the DNN model, we summed a reward at each step in each episode and counted the number of steps to end each episode. Figure 3 and Figure 4 show the reward and the number of steps, respectively, according to the episode in the first 40,000 episodes. The x-axis in both figures represents a unit composed of consecutive 1000 episodes.
Figure 3 represents accumulated rewards after each unit of 1000 episodes proceeded. All three instances show that the accumulated reward during the second unit has a greater value than the reward accumulated during the first unit. Subsequently, oscillation occurs mainly caused by randomly given number of items and size of capacities in each episode. When the emergence of oscillation is considered, it is observed that the accumulated rewards converge to certain values. In such a way, for RI, LI, and QI, respectively, rewards converge to 8000, 5000, and 8500 in three knapsacks, and converge to 8500, 6000, and 9000 in 5 knapsacks, respectively.
Figure 4 represents accumulated number of steps during proceeding each unit of episodes. In Figure 4, it is observed that all three instances get through an enormous number of steps in the initial 1000 episodes. However, in the second 1000 episodes, they get through much less number of steps. In one knapsack, the number of steps for all instances converges to near 10,000. In three knapsacks, each number of steps converges to 17,500 for RI, 20,000 for LI, and values between 15,000 and 17,500 for QI. In five knapsacks, it converges to values between 17,500 and 20,000 for RI, 20,000 for LI, and values between 15,000 and 17,500 for QI.

4.2. Result of Solutions

In Section 4.1, we observed the trained models get better performance in terms of reward and number of steps. In this section, we will verify the performance of the trained models applied to unseen problems. The trained models and baseline methods solved 1000 problems per problem type. The types of problems are discussed in the first paragraph of Section 4. We calculated the average value based on the profits of the selected items in the solutions. Table 1, Table 2 and Table 3 show the profit of selected items divided by the profit of total items. The optimal result of each instance represents the maximum feasible profit ratio in each knapsack and capacity. Table 4, Table 5 and Table 6 show the percentage of profit derived from each algorithm over the profit of the optimal solution. The percentage of each field represents how close the derived solution is to the optimal solution. Table 7, Table 8 and Table 9 show the total profit of the selected items. Table 10, Table 11 and Table 12 represent the average profits of the selected items per capacity. Each field of them is an average value of the aggregate profit of the selected items in all knapsacks for a given capacity.

4.2.1. Random Instances

The optimal result in Table 1 shows 0.1743∼0.6573 profit ratio of selected items in a single knapsack. It means the maximized solution in a single knapsack can have a profit ratio up to 0.1743∼0.6573. Furthermore, as shown in Table 4, 99% of the closeness appears in both the proposed algorithm and the greedy algorithm in all sizes of capacity in single knapsacks. In Table 7, it is observed that the proposed algorithm’s average profit of a single knapsack differs by only 0.001 from the greedy algorithm’s. In the 3-knapsack case, in 10 and 20 capacity, the proposed algorithm is superior to the random solution and the greedy algorithm. In the 5-knapsack case, in 10 capacity, the proposed algorithm showed an outstanding result than the greedy algorithm. The proposed algorithm earns a profit of 166 more than the greedy algorithm when solving 1000 problems. In terms of capacity, as shown in Table 10, the average profit of the proposed algorithm has equally highest value or the highest value in the overall capacity.

4.2.2. Linear Instances

In a single knapsack, the profit of the selected item can be up to 0.0488∼0.3911 as shown in Table 2. In terms of closeness to the optimal solution, as the size of capacity becomes larger, the closeness also becomes larger. The proposed algorithm has 98.7% in 10 capacity and 99.5% in 100 capacity. The proposed algorithm outperforms other algorithms in 40, 80, 100 capacity. In the three-knapsack case, except for capacity 80, the proposed algorithm shows outstanding results. Except for capacity 20, the proposed algorithm and the greedy method fall short of the random solution in the five-knapsack case. In capacity 20, the proposed algorithm outperforms the other two algorithms. The average profit of the proposed algorithm has the highest value in the overall capacity as shown in Table 11.

4.2.3. Quadratic Instances

In a single knapsack, the profit ratio of selected items can be up to 0.1049∼0.6767 as shown in Table 3. In terms of closeness to the optimal solution, the proposed algorithm’s result shows 99% in overall capacity, and it outperforms other algorithms. The average profit of the proposed algorithm gets 3.171, 5.219, 7.92 more profit than the greedy algorithms in a single knapsack case, three-knapsack case, and in the five-knapsack case, respectively. When solving 1000 problems, the proposed algorithm earns a profit of 5055 more than the greedy algorithm. In terms of capacity as shown in Table 12, the average profit of the proposed algorithm has the highest value in the overall capacity.

4.3. Computational Time

Table 13, Table 14 and Table 15 shows computation time solving 1000 test problems in RI, LI, and QI, respectively. The proposed method’s results are the second-fastest in the overall tables, trailing only the greedy method. Both the optimal and random solutions fall short of the previous two methods. The proposed algorithm and optimal solution are heavily influenced by the capacity of each knapsack.

4.4. Overall Evaluation

The greedy algorithm is very fast, but its performance was not the best in many cases of the experiments. The greedy algorithm, in particular, would be vulnerable when it encountered instances where profit and weight are correlated, such as LI. The proposed method can provide a more robust solution because it produced excellent results in a variety of situations, including the correlation between profit and weight.

5. Conclusions

In this paper, to solve MKP efficiently, we proposed a DRL-based solution. The proposed method trains the neural model by using A3C algorithm. The experiments show the proposed method achieves higher performance compared to the greedy algorithm and random solution. The results in random, linear, and quadratic instances demonstrated that the proposed algorithm is robust. Consequently, we verified that the proposed method is appropriate for solving real-world MKPs with varying profit, weight, and capacity.

Author Contributions

Conceptualization, J.K. and H.L.; methodology, G.S. and H.L.; investigation, G.S.; formal analysis, G.S. and H.L.; validation, G.S. and H.L.; writing—original draft preparation, G.S. and H.L; writing—review and editing, G.S. and H.L.; and supervision, H.L.; project administration, J.K. and S.Y.R.; funding acquisition, S.Y.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Agency for Defense Development, Republic of Korea, under Grant UD190020FD.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Assi, M.; Haraty, R.A. A Survey of the Knapsack Problem. In Proceedings of the International Arab Conference on Information Technology (ACIT), Werdanye, Lebanon, 28–30 November 2018; pp. 1–6. [Google Scholar] [CrossRef]
  2. Lahyani, R.; Chebil, K.; Khemakhem, M.; Coelho, L.C. Matheuristics for solving the multiple knapsack problem with setup. Comput. Ind. Eng. 2019, 129, 76–89. [Google Scholar] [CrossRef]
  3. Kumaraguruparan, N.; Sivaramakrishnan, H.; Sapatnekar, S.S. Residential task scheduling under dynamic pricing using the multiple knapsack method. In Proceedings of the IEEE PES Innovative Smart Grid Technologies (ISGT), Washington, DC, USA, 16–20 January 2012; pp. 1–6. [Google Scholar]
  4. Ketykó, I.; Kecskés, L.; Nemes, C.; Farkas, L. Multi-user computation offloading as multiple knapsack problem for 5G mobile edge computing. In Proceedings of the European Conference on Networks and Communications (EuCNC), Athens, Greece, 27–30 June 2016; pp. 225–229. [Google Scholar]
  5. Cappanera, P.; Paganelli, F.; Paradiso, F. VNF placement for service chaining in a distributed cloud environment with multiple stakeholders. Comput. Commun. 2019, 133, 24–40. [Google Scholar] [CrossRef]
  6. Martello, S.; Toth, P. Knapsack Problems: Algorithms and Computer Implementations; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1990. [Google Scholar]
  7. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef] [Green Version]
  8. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  9. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
  10. Gu, S.; Hao, T. A pointer network based deep learning algorithm for 0–1 knapsack problem. In Proceedings of the International Conference on Advanced Computational Intelligence (ICACI), Xiamen, China, 29–31 March 2018; pp. 473–477. [Google Scholar]
  11. Afshar, R.R.; Zhang, Y.; Firat, M.; Kaymak, U. A State Aggregation Approach for Solving Knapsack Problem with Deep Reinforcement Learning. In Proceedings of the Asian Conference on Machine Learning (PMLR), Bangkok, Thailand, 18–20 November 2020; pp. 81–96. [Google Scholar]
  12. Dai, H.; Dai, B.; Song, L. Discriminative embeddings of latent variable models for structured data. In Proceedings of the 33rd International Conference on Machine Learning (PMLR), New York, NY, USA, 19–24 June 2016; pp. 2702–2711. [Google Scholar]
  13. Dai, H.; Khalil, E.B.; Zhang, Y.; Dilkina, B.; Song, L. Learning combinatorial optimization algorithms over graphs. arXiv 2017, arXiv:1704.01665. [Google Scholar]
  14. Kundu, O.; Dutta, S.; Kumar, S. Deep-pack: A vision-based 2d online bin packing algorithm with deep reinforcement learning. In Proceedings of the International Conference on Robot and Human Interactive Communication (RO-MAN), New Delhi, India, 14–18 October 2019; pp. 1–7. [Google Scholar]
  15. Zhang, Z.; Liu, H.; Zhou, M.; Wang, J. Solving Dynamic Traveling Salesman Problems With Deep Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2021. [Google Scholar] [CrossRef] [PubMed]
  16. Nazari, M.; Oroojlooy, A.; Snyder, L.; Takác, M. Reinforcement learning for solving the vehicle routing problem. In Proceedings of the 32nd International Conference on Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
  17. Chen, X.; Tian, Y. Learning to perform local rewriting for combinatorial optimization. In Proceedings of the 33nd International Conference on Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  18. Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer networks. arXiv 2015, arXiv:1506.03134. [Google Scholar]
  19. Bello, I.; Pham, H.; Le, Q.V.; Norouzi, M.; Bengio, S. Neural combinatorial optimization with reinforcement learning. arXiv 2016, arXiv:1611.09940. [Google Scholar]
  20. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning (PMLR), New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  21. Pisinger, D. An exact algorithm for large multiple knapsack problems. Eur. J. Oper. Res. 1999, 114, 528–541. [Google Scholar] [CrossRef]
  22. Dell’Amico, M.; Delorme, M.; Iori, M.; Martello, S. Mathematical models and decomposition methods for the multiple knapsack problem. Eur. J. Oper. Res. 2019, 274, 886–899. [Google Scholar] [CrossRef]
  23. Srinivas, M.; Patnaik, L.M. Genetic algorithms: A survey. Computer 1994, 27, 17–26. [Google Scholar] [CrossRef]
  24. Khuri, S.; Bäck, T.; Heitkötter, J. The zero/one multiple knapsack problem and genetic algorithms. In Proceedings of the 1994 ACM Symposium on Applied Computing, Phoenix, AZ, USA, 6–8 March 1994; pp. 188–193. [Google Scholar]
  25. Falkenauer, E. A new representation and operators for genetic algorithms applied to grouping problems. Evol. Comput. 1994, 2, 123–144. [Google Scholar] [CrossRef]
  26. Fukunaga, A.S. A new grouping genetic algorithm for the multiple knapsack problem. In Proceedings of the 2008 IEEE Congress on Evolutionary Computation, Hong Kong, China, 1–6 June 2008; pp. 2225–2232. [Google Scholar]
  27. Han, K.H.; Kim, J.H. Quantum-inspired evolutionary algorithm for a class of combinatorial optimization. IEEE Trans. Evol. Comput. 2002, 6, 580–593. [Google Scholar] [CrossRef] [Green Version]
  28. Hu, H.; Zhang, X.; Yan, X.; Wang, L.; Xu, Y. Solving a new 3d bin packing problem with deep reinforcement learning method. arXiv 2017, arXiv:1708.05930. [Google Scholar]
  29. Rezoug, A.; Bader-El-Den, M.; Boughaci, D. Application of Supervised Machine Learning Methods on the Multidimensional Knapsack Problem. Neural Process. Lett. 2021, 1–20. [Google Scholar] [CrossRef]
  30. García, J.; Lalla Ruiz, E.; Voß, S.; Lopez Droguett, E. Enhancing a machine learning binarization framework by perturbation operators: Analysis on the multidimensional knapsack problem. Int. J. Mach. Learn. Cybern. 2020, 11, 1951–1970. [Google Scholar] [CrossRef]
  31. Laterre, A.; Fu, Y.; Jabri, M.K.; Cohen, A.S.; Kas, D.; Hajjar, K.; Dahl, T.S.; Kerkeni, A.; Beguir, K. Ranked reward: Enabling self-play reinforcement learning for combinatorial optimization. arXiv 2018, arXiv:1807.01672. [Google Scholar]
  32. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning (PMLR), Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
  33. Silver, D. Lectures on Reinforcement Learning. 2015. Available online: https://www.davidsilver.uk/teaching/ (accessed on 11 March 2022).
Figure 1. State of multiple knapsack problem.
Figure 1. State of multiple knapsack problem.
Applsci 12 03068 g001
Figure 2. Capacities of knapsacks in each state.
Figure 2. Capacities of knapsacks in each state.
Applsci 12 03068 g002
Figure 3. DRL training information of reward in episodes. (a) Reward of RI; (b) Reward of LI; (c) Reward of QI.
Figure 3. DRL training information of reward in episodes. (a) Reward of RI; (b) Reward of LI; (c) Reward of QI.
Applsci 12 03068 g003
Figure 4. DRL training information of step in episodes. (a) Step of RI; (b) Step of LI; (c) Step of QI.
Figure 4. DRL training information of step in episodes. (a) Step of RI; (b) Step of LI; (c) Step of QI.
Applsci 12 03068 g004
Table 1. Profit ratio of selected items for RI set.
Table 1. Profit ratio of selected items for RI set.
MNCapacityRandomGreedyProposedOptimal
150100.15070.17270.17270.1743
200.20640.26520.26520.2672
400.30020.39600.39580.3979
800.46380.58530.58530.5873
1000.53540.65530.65510.6573
350100.27860.32300.32560.3376
200.40160.48550.48760.4995
400.61380.71650.71660.7293
800.92850.96780.96780.9738
1000.99520.99850.99850.9993
550100.36650.42350.42680.4530
200.54490.63230.63230.6609
400.82390.90290.90300.9218
Table 2. Profit ratio of selected items for LI set.
Table 2. Profit ratio of selected items for LI set.
MNCapacityRandomGreedyProposedOptimal
150100.04670.04830.04830.0488
200.08480.08930.08930.0901
400.15870.16640.16640.1675
800.30600.31530.31560.3170
1000.37980.38930.38940.3911
350100.12380.12440.12630.1299
200.23470.23770.23980.2442
400.45220.45550.45810.4638
800.87200.86420.86650.8830
1000.99140.99170.99290.9964
550100.19560.19200.19200.2061
200.37760.37510.37840.3920
400.72400.71910.72080.7438
Table 3. Profit ratio of selected items for QI set.
Table 3. Profit ratio of selected items for QI set.
MNCapacityRandomGreedyProposedOptimal
150100.10100.10090.10460.1049
200.17330.18880.19340.1951
400.28470.33930.34360.3467
800.46790.58260.58560.5879
1000.54500.67240.67490.6767
350100.24960.25020.26340.2755
200.39770.44630.46350.4760
400.63140.73530.73510.7606
800.94260.98690.98660.9911
1000.99700.99950.99950.9998
550100.35740.36930.36930.4142
200.56300.62840.65070.6819
400.84780.93080.93550.9550
Table 4. The closeness to the optimal solution in RI set.
Table 4. The closeness to the optimal solution in RI set.
MNCapacityRandomGreedyProposedOptimal
1501086.4699.1099.10-
2077.2499.2899.28-
4075.4699.5299.51-
8079.0199.6699.66-
10081.4999.6999.70-
Average79.9399.4599.45-
3501082.5695.7096.44-
2080.4497.2197.62-
4084.2198.2598.25-
8095.3499.3899.38-
10099.5999.9299.92-
Average88.4398.0998.32-
5501080.9393.4994.21-
2082.4995.6895.68-
4089.3997.9497.95-
Average84.2795.7095.95-
Average84.3298.0698.21-
Table 5. The closeness to the optimal solution in LI set.
Table 5. The closeness to the optimal solution in LI set.
MNCapacityRandomGreedyProposedOptimal
1501095.5998.7698.74-
2094.0899.1199.11-
4094.7599.3299.34-
8096.5499.4699.56-
10097.1299.5299.54-
Average95.6299.2399.26-
3501095.3495.7497.26-
2096.1297.3098.18-
4097.5098.2198.78-
8098.7697.8898.13-
10099.4699.4899.61-
Average97.4497.7298.39-
5501094.9293.1093.10-
2096.3395.6896.52-
4097.3596.6996.91-
Average96.2095.1695.51-
Average96.4597.7198.06-
Table 6. The closeness to the optimal solution in QI set.
Table 6. The closeness to the optimal solution in QI set.
MNCapacityRandomGreedyProposedOptimal
1501096.2296.1299.67-
2088.8396.7199.11-
4082.2497.9099.13-
8079.7199.1099.61-
10080.6799.3599.73-
Average85.5397.8499.45-
3501090.5790.8295.61-
2083.6293.7597.38-
4083.0896.6796.64-
8095.1399.5799.54-
10099.7399.9899.98-
Average90.4396.1697.83-
5501086.3289.2089.20-
2082.6392.1695.46-
4088.8097.4597.95-
Average85.9292.9394.20-
Average87.5196.0697.61-
Table 7. Profit of selected items in RI set.
Table 7. Profit of selected items in RI set.
MNCapacityRandomGreedyProposedOptimal
1501041.29247.32647.32647.758
2056.82273.03473.03473.566
4082.118108.303108.293108.825
80127.443160.748160.748161.3
100147.755180.763180.765181.318
Average91.086114.0348114.0332114.5534
3501076.19288.31989.00292.287
20110.333133.321133.89137.154
40168.359196.425196.437199.933
80255.865266.714266.714268.37
100272.998273.911273.914274.129
Average176.7494191.738191.9914194.3746
55010100.253115.813116.701123.873
20150.122174.13174.13181.996
40226.462248.135248.157253.353
Average158.9457179.3593179.6627186.4073
Average139.693158.996159.162161.836
Table 8. Profit of selected items in LI set.
Table 8. Profit of selected items in LI set.
MNCapacityRandomGreedyProposedOptimal
1501013.89314.35314.35014.533
2025.24826.60026.60026.838
4047.58849.88049.89150.223
8091.61094.38394.48294.896
100113.521116.333116.355116.891
Average58.37260.31060.33660.676
3501036.80936.96337.55238.609
2069.81970.67671.31472.634
40134.866135.845136.632138.321
80260.752258.425259.095264.028
100297.773297.832298.217299.392
Average160.004159.948160.562162.597
5501058.28657.16857.16861.405
20112.250111.501112.476116.531
40216.931215.463215.970222.845
Average129.156128.044128.538133.594
Average113.796114.263114.623116.704
Table 9. Profit of selected items in QI set.
Table 9. Profit of selected items in QI set.
MNCapacityRandomGreedyProposedOptimal
1501087.78587.70190.93791.238
20150.314163.655167.717169.217
40248.873296.239299.961302.608
80409.533509.123511.738513.757
100476.957587.385589.609591.216
Average274.692328.821331.992333.607
35010216.993217.575229.067239.574
20347.519389.616404.681415.581
40549.914639.802639.619661.874
80825.29863.778863.491867.508
100873.759875.893875.901876.105
Average562.695597.333602.552612.128
55010311.933322.349322.349361.385
20489.821546.285565.848592.788
40741.96814.227818.424835.539
Average514.571560.954568.874596.571
Average440.819485.664490.719501.415
Table 10. Average profit of selected items per capacity in RI.
Table 10. Average profit of selected items per capacity in RI.
NCapacityRandomGreedyProposedOptimal
501072.57983.81984.34387.973
20105.759126.828127.018130.905
40158.980184.288184.296187.370
80191.654213.731213.731214.835
100210.377227.337227.340227.724
Table 11. Average profit of selected items per capacity in LI.
Table 11. Average profit of selected items per capacity in LI.
NCapacityRandomGreedyProposedOptimal
501036.32936.16136.35738.182
2069.10669.59270.13072.001
40133.128133.729134.164137.130
80176.181176.404176.788179.462
100205.647207.083207.286208.142
Table 12. Average profit of selected items per capacity in QI.
Table 12. Average profit of selected items per capacity in QI.
NCapacityRandomGreedyProposedOptimal
5010205.570209.208214.118230.732
20329.218366.519379.415392.529
40513.582583.423586.001600.007
80617.412686.451687.615690.633
100675.358731.639732.755733.661
Table 13. Computation time (seconds) in RI.
Table 13. Computation time (seconds) in RI.
MNCapacityRandomGreedyProposedOptimal
1042.5480.0943.12110.875
2043.1710.1094.43812.043
1504044.4910.1096.25614.060
8047.3430.1258.33716.530
10048.2600.1258.93617.406
1043.1150.1726.84620.756
2045.1370.17210.26721.039
3504048.2540.17211.50825.326
8051.6070.16914.08729.009
10052.5050.17215.31827.187
1044.6450.20313.32424.618
5502047.8900.18811.78929.347
4050.4710.18813.79632.295
Average46.8800.1549.84821.576
Table 14. Computation time (seconds) in LI.
Table 14. Computation time (seconds) in LI.
MNCapacityRandomGreedyProposedOptimal
1052.5490.1093.3968.696
2053.0020.1414.9109.176
1504044.4790.1256.91510.281
8047.6410.1259.11714.083
10048.5500.10910.37414.842
1051.2140.1889.36019.139
2045.7420.17212.80427.927
3504047.4960.17213.28032.529
8051.1700.17214.49342.591
10051.0730.17515.41528.046
1044.6770.20311.61333.515
5502047.4230.18714.37940.987
4051.0200.17219.19152.013
Average48.9260.15811.17325.679
Table 15. Computation time (seconds) in QI.
Table 15. Computation time (seconds) in QI.
MNCapacityRandomGreedyProposedOptimal
1042.4380.1093.2228.800
2043.2320.1093.7199.566
1504044.2110.1259.36910.872
8046.1440.1258.27412.444
10047.7840.12512.98612.904
1043.7920.1726.44516.455
2045.2390.1719.55420.504
3504048.2140.1729.66124.433
8051.5160.17214.42022.745
10050.8970.17215.89121.201
1045.0920.1995.80124.654
5502047.2950.2039.75446.842
4049.9800.18313.73633.331
Average46.6030.1579.44920.366
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Sur, G.; Ryu, S.Y.; Kim, J.; Lim, H. A Deep Reinforcement Learning-Based Scheme for Solving Multiple Knapsack Problems. Appl. Sci. 2022, 12, 3068. https://doi.org/10.3390/app12063068

AMA Style

Sur G, Ryu SY, Kim J, Lim H. A Deep Reinforcement Learning-Based Scheme for Solving Multiple Knapsack Problems. Applied Sciences. 2022; 12(6):3068. https://doi.org/10.3390/app12063068

Chicago/Turabian Style

Sur, Giwon, Shun Yuel Ryu, JongWon Kim, and Hyuk Lim. 2022. "A Deep Reinforcement Learning-Based Scheme for Solving Multiple Knapsack Problems" Applied Sciences 12, no. 6: 3068. https://doi.org/10.3390/app12063068

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop