Review of Stochastic Dynamic Vehicle Routing in the Evolving Urban Logistics Environment

Mardešić, Nikola; Erdelić, Tomislav; Carić, Tonči; Đurasević, Marko

doi:10.3390/math12010028

Open AccessReview

Review of Stochastic Dynamic Vehicle Routing in the Evolving Urban Logistics Environment

¹

Faculty of Transport and Traffic Sciences, University of Zagreb, Vukelićeva 4, 10000 Zagreb, Croatia

²

Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(1), 28; https://doi.org/10.3390/math12010028

Submission received: 1 December 2023 / Revised: 18 December 2023 / Accepted: 19 December 2023 / Published: 21 December 2023

(This article belongs to the Special Issue Advances in Genetic Programming and Soft Computing)

Download

Browse Figures

Versions Notes

Abstract

:

Urban logistics encompass transportation and delivery operations within densely populated urban areas. It faces significant challenges from the evolving dynamic and stochastic nature of on-demand and conventional logistics services. Further challenges arise with application doctrines shifting towards crowd-sourced platforms. As a result, “traditional” deterministic approaches do not adequately fulfil constantly evolving customer expectations. To maintain competitiveness, logistic service providers must adopt proactive and anticipatory systems that dynamically model and evaluate probable (future) events, i.e., stochastic information. These events manifest in problem characteristics such as customer requests, demands, travel times, parking availability, etc. The Stochastic Dynamic Vehicle Routing Problem (SDVRP) addresses the dynamic and stochastic information inherent in urban logistics. This paper aims to analyse the key concepts, challenges, and recent advancements and opportunities in the evolving urban logistics landscape and assess the evolution from classical VRPs, via DVRPs, to state-of-art SDVRPs. Further, coupled with non-reactive techniques, this paper provides an in-depth overview of cutting-edge model-based and model-free reactive solution approaches. Although potent, these approaches become restrictive due to the “curse of dimensionality”. Sacrificing granularity for scalability, researchers have opted for aggregation and decomposition techniques to overcome this problem and recent approaches explore solutions using deep learning. In the scope of this research, we observed that addressing real-world SDVRPs with a comprehensive resolution encounters a set of challenges, emphasising a substantial gap in the research field that warrants further exploration.

Keywords:

review; stochastic dynamic vehicle routing problem; Markov decision process; approximate dynamic programming; reinforcement learning

MSC:

90-02

1. Introduction

Urban logistics encapsulate the management and coordination of transportation and delivery operations within densely populated urban areas by accounting for diverse factors like population density, traffic variability, customer behaviour, and environmental regulations. A pivotal challenge within this domain is the last mile, marked by its high costs, inefficiency, and environmental impact [1]. This challenge is further amplified by the surge in contemporary online on-demand services, which intertwine with the Stochastic Dynamic Vehicle Routing Problem, adding layers of complexity to an already intricate urban logistics landscape. As cities evolve, the dynamics of these challenges demand innovative solutions for efficient and sustainable urban logistics frameworks.

As a result of urban logistics planning, customer orders are allocated to vehicles by determining the most efficient visiting sequence of customers and routes [1], i.e., by solving an instance of the Vehicle Routing Problem (VRP). Traditionally, urban logistics problems were approached from a deterministic perspective, assuming all relevant data was known exactly and collected before problem solving. Moreover, the data were considered static, with no consideration for changes over time. This approach relied on predetermined routes and fixed delivery schedules. However, the rise of dynamic online platforms has made this approach inadequate. On-demand services have introduced stochasticity and complexity due to evolving customer behaviour, higher order volumes, stricter delivery windows, and the demand for fast and reliable service [1]. Therefore, a paradigm shift towards real-time logistics has become imperative. Urban logistics now face numerous sources of stochastic information, including fluctuating demand, specific requests and preferences, variable travel times, parking availability, as well as crowd-sourced drivers and their behaviours. To address these challenges and maintain competitiveness, logistic service providers must adopt proactive and anticipatory approaches to route planning and execution. As a result, the concept of Stochastic Dynamic VRP (SDVRP) has emerged in the VRP literature. SDVRP accounts for the urban logistics dynamic and stochastic environment by dynamically modelling and evaluating probable (future) events, i.e., stochastic information. As observed in the literature, the inclusion and evaluation of future potential events consequently improved the level of service provided to customers.

1.1. Recent Literature Reviews and Scientific Contribution

Extensive research has been conducted on SDVRPs, leading to significant advancements in understanding and addressing the challenges within this field. Notably, to the best of our knowledge, several recent comprehensive literature reviews, including those presented in [2,3,4,5,6,7,8], have synthesised existing research, identified key trends, and shed light on the complexities of the domain.

Building on the evolution and quality of information and the real-time VRP problem classification presented in [9], the authors in [2] classified VRPs into four categories: Static and Deterministic (SD), Static and Stochastic (SS), Dynamic and Deterministic (DD), and Dynamic and Stochastic (DS). The review followed with a survey of the state-of-the-art solution techniques for dynamic routing.

Within Dynamic VRPs (DVRPs), researchers have proposed various taxonomy frameworks to categorise and comprehend the different dimensions of the problem effectively. A comprehensive DVRP taxonomy encompassing 11 main criteria, each followed by its main variants, was proposed in [4]. Further enriching the proposed taxonomy, the authors in [5] introduced a second axis, a solution method taxonomy. This taxonomy distinguished four leading solution approaches and three modes of algorithm execution: offline, online, and hybrid. The review followed with a comprehensive analysis of relevant characteristics of the problem, such as the most frequent sources of dynamism and stochasticity, solution methods, and applications.

While the latter covered static and dynamic DVRP axis, the reviews presented in [3,6,7,8] focused exclusively on the DS aspects of VRPs. A characterisation based on the optimisation algorithms, distinguishing between methods relying on precomputed decisions (offline) and online computation, was proposed in [3]. The survey further undertook an in-depth overview of the work on stochastic aspects such as travel times, demands, customer requests, and papers combining multiple stochastic sources. The review in [6] provided a comprehensive discussion on SDVRPs through the prism of perspective analytics. The review included definitions of methodology, approximation architectures, primary sources of uncertainty and their modelling approaches, categorisations of problem domains, and utilised solution approaches.

A detailed review of the SDVRP literature focusing on Reinforcement Learning (RL) was provided in [8]. The authors shed light on the complexities inherent in SDVRPs, mainly the curse of dimensionality. The complexities manifested in the literature via a clear division of solution approaches incorporating or excluding routing information from the state space. The ones excluding the routing information focused primarily on the feasibility and benefit of assigning requests to vehicles and employed prompt heuristics for route constructions. On the other hand, at the cost of increased complexity and restricted action space, route data enabled more refined and informed actions. The authors further highlighted the need for a combined approach of searching a combinatorial action space while evaluating it via RL. Focusing their comprehensive review on DVRPs with random arrivals of customer requests, the authors in [7] proposed a taxonomy that classifies the DVRP based on three criteria: source of dynamism, request type, and planning horizon.

This paper aims to analyse the key concepts, challenges, recent advancements and opportunities in the evolving urban logistics landscape and assess the evolution from classical VRPs, via DVRPs, to state-of-art SDVRPs. The paper’s primary contribution lies in describing computational methods, including exact, heuristics, metaheuristics, and contemporary policies. Unlike prior reviews with a limited focus on specific domain solution approaches, this paper embraces non-reactive techniques followed by an in-depth overview of a diverse range of cutting-edge model-based and model-free reactive solutions. Additionally, the enumeration and description of open-source datasets set this review apart, offering a valuable resource for SDVRP researchers and practitioners. This inclusive approach distinguishes this work by addressing various aspects simultaneously, enhancing its utility and practicality compared to previous reviews presented in [2,3,4,5,6,7,8]. The following contributions of this paper are outlined:

A chronological and comprehensive perspective on the progression of solution approaches in the VRP domain, evolving from methods found in classical VRPs, via DVRPs, to state-of-the-art SDVRPs.
An in-depth overview of non-reactive and reactive solution approaches, emphasising model-based and model-free cutting-edge reactive methods and challenges induced by the curse of dimensionality.
An enhanced, detailed literature classification based on Markov Decision Process (MDP) reactive solution approaches.
An enumeration and description of relevant open-source datasets employed in the SDVRP literature.

1.2. Organisation of This Paper

The remainder of the paper is structured as follows. Section 2 presents an overview of distinct categories of VRPs and introduces traditional heuristic and metaheuristic VRP approaches. Section 3 shifts the focus to the taxonomy and challenges of Dynamic VRP (DVRP), accompanied by a review of core Deterministic DVRP (DDVRP) solution methods. Section 4 classifies solution approaches and provides an in-depth overview of cutting-edge model-based and model-free reactive solution methods. Section 5 highlights the absence of standard benchmarks and lists and describes open-source SDVRP datasets in the literature. Section 6 finishes with a critical overview of the reviewed literature, followed by a discussion on challenges and opportunities for future research in the SDVRP domain. Finally, Section 7 serves as the conclusion, emphasising the need for SDVRPs and the challenges faced in the domain.

2. Vehicle Routing Problem

The Vehicle Routing Problem (VRP) is a combinatorial optimisation problem that aims to determine the optimal set of service sequences and routes from a depot to geographically dispersed customers, considering operational constraints [10]. As illustrated in Figure 1, assigning customers to vehicles and determining the sequence of visits for each vehicle are fundamental elements in finding a solution to the VRP.

2.1. Evolution and Quality of Information

From an information perspective, VRPs involve the evolution and quality of information [11]. The evolution of information in VRPs refers to the changing nature (static or dynamic) of available information during route execution. This includes adapting to changes, such as incorporating new customer requests that arise after the vehicle has already left the depot. On the contrary, the latter refers to the degree of uncertainty (deterministic or stochastic) in the available data. This uncertainty highlights the potential variability in the accuracy and reliability of the information. For instance, a customer’s demand is known only as a range estimate rather than an exact value.

Furthermore, decision makers can design service sequences either a priori or online. A priori refers to planning the sequence before the vehicle departs, while online involves adjusting or constructing the sequence after the vehicle has left the depot. Based on the evolution and quality of information and the real-time VRP problem classification initially proposed in the extensive reviews given in [2,9], VRPs are classified into four categories, as presented in Table 1:

-: Static and deterministic: All relevant data for the problem are available from the beginning and remain constant throughout the route execution. There are no unknown or uncertain elements, allowing for the design and execution of routes based on complete and fixed information. Figure 2 illustrates the planning and execution phases of a static and deterministic VRP. The planning phase ( $t_{0}$ ) involves generating a feasible solution for the known problem inputs, such as customer requests. In the execution phase ( $t_{k - n}$ ), the ongoing process of servicing customers unfolds. Although new inputs may arrive during this phase, the planned routes remain static. The process repeats at the termination phase ( $t_{k}$ ), where a new cycle begins with planning and execution based on new inputs.
-: Static and stochastic: Certain problem elements, such as requests, travel times, or demands, are only partially available and modelled with certain distribution functions. As the route executes, the actual values of these parameters become known. While routes are pre-designed, they can be adjusted as needed. For instance, the route has to be modified if a customer’s demand exceeds the forecasted demand and thus exceeds the available vehicle capacity. Figure 3 illustrates the static and stochastic VRP’s planning and execution phases. In the planning phase ( $t_{0}$ ), a feasible solution is generated for known inputs like customer requests and associated demand values, which can be either deterministic or stochastic. For instance, request number three (3) features a stochastic demand. During the execution phase ( $t_{k - n}$ ), as the vehicle services customers, unexpected issues may arise. At request number three (3), it is discovered that the initial estimate of the demand exceeds the vehicle’s capacity. Subsequently, a new feasible plan is devised, leading the vehicle to return to the depot, unload excess capacity, and then resume the route, including request number three (3). This cycle repeats in the termination phase ( $t_{k}$ ), initiating a new planning and execution cycle based on updated inputs.
-: Dynamic and deterministic: Some or all problem data are initially unavailable but gradually revealed during the planning and execution. There are no inherent uncertainties in the variables. For example, when a new dynamic event occurs, complete and fixed information about it becomes available at the time of the event. Figure 4 illustrates the planning and execution phases of a dynamic and deterministic VRP. The planning phase ( $t_{0}$ ) involves generating a feasible solution to the known problem and the execution phase ( $t_{k - n}$ ) allows for dynamic route adaptation to the changing conditions. New customer requests are revealed during execution and the DVRP accommodates their inclusion in the current routes, adhering to constraints. Dynamic adaptation aims to enhance fleet resource utilisation by promptly adjusting their plans to recent information.
-: Dynamic and stochastic: This problem category involves gradual information revelation and stochastic variables that add uncertainty. Decision makers continuously sample probable (future) scenarios and adjust service sequences and routes in response to changing conditions and probability distributions. Figure 5 illustrates the planning and execution phases of a dynamic and stochastic VRP. In the left frame, the initial planning phase generated a set of feasible, optimised routes without accounting for probable future events. In the middle frame, the a priori planning phase considers known information and anticipated inputs sampled from probability distributions. Including such data in planning and execution horizons leads to proactive actions that enhance operational efficiency, customer satisfaction, and fleet resource utilisation.

Understanding the characteristics of VRPs is crucial for effective route planning and optimisation. VRPs can be classified as static or dynamic based on whether the input data remains constant or changes over time. Deterministic VRPs assume all required input data is known during the route design phase. In contrast, stochastic VRPs involve stochasticity or variability in the input data, requiring probabilistic or random variables to represent specific components. Incorporating stochastic elements in VRPs enables anticipation and allows for robust, adaptive, and effective route planning and optimisation.

2.2. Conventional Solution Methods

Over the years, extensive research has been conducted on VRPs, since their formal definition in [12]. For a comprehensive overview of the VRP taxonomy, one may seek the review presented in [13]. Traditionally, the majority of VRPs were considered static and deterministic. Researchers have developed various solutions using Mixed Integer Programs (MIPs) based on mathematical graphs to generate optimal or near-optimal solutions for small instances. However, the non-polynomial complexity associated with VRPs, i.e., the solution space expands significantly as the problem size increases, making it computationally challenging to find optimal solutions for larger problem sizes using exact methods alone. To address this challenge, researchers have incorporated heuristics and metaheuristics in their solution approaches for VRPs. These enhanced approaches enable the application of VRPs to real-world scenarios, including large problem instances.

2.2.1. Exact Algorithms

Exact algorithms are highly effective for solving VRPs by finding the optimal solution. In the case of VRP instances with 50 to 100 customers, problem solvers can employ various exact methods, including branch and bound [14,15], dynamic programming [16], linear integer programming [17], and more. These methods systematically explore the solution space to determine the optimal routing configuration that minimises the objective function.

2.2.2. Heuristic Algorithms

As optimisation algorithms, heuristics are employed to obtain satisfactory solutions within a reasonable timeframe. Heuristics leverage problem-specific knowledge and experience to guide the search for feasible solutions. They may involve trial and error, approximation, or problem simplification and do not guarantee finding the optimal solution. However, they can often find good solutions quickly and are widely used in practice due to their efficiency. There is a tradeoff between the runtime and solution quality of heuristics for the VRP. The quality of the final solution tends to improve the longer the heuristics are executed. Heuristics for the VRP can be divided into constructive and improvement heuristics [18].

Constructive heuristics for the VRP are algorithms that generate a solution by constructing routes through the set of customers. These heuristics use intuitive strategies, such as adding the nearest unvisited customer to the route until all customers are visited. Noteworthy constructive heuristics for the VRP include the Clarke and Wright savings algorithm [19], nearest and farthest insertion [20], and nearest neighbour [20] methods. These heuristics can be executed in serial or parallel mode, depending on the algorithm and available computational resources [21]. Selecting an appropriate solution construction approach is vital as it significantly impacts the search space, the likelihood of finding the optimal solution, and the overall running time.

Improvement heuristics are algorithms that explore the neighbourhood of the current solution to improve the VRP solution [22]. Within the VRP domain, they are often referred to as Local Search (LS) procedures. These heuristics iteratively modify the current solution by applying minor changes and evaluating the resulting improvement in the objective function. Notable operators used in local search heuristics include relocate and exchange operators [23], or-opt-k operators [24], and 2-opt operators [25]. These heuristics effectively enhance the solution quality by focusing on the immediate neighbourhood of the current solution. However, it is vital to consider their sensitivity to the initial solution and the specific search neighbourhood employed, as they may converge to local optima. When combined with other heuristics or metaheuristics, local search operators can improve the quality of VRP solutions [18].

Using multiple local search improvements in a single algorithm iteration for a VRP route can improve the quality of the solution and make the overall algorithm faster [26]. This approach requires fewer moves as the operators work together to find a better solution, potentially reducing the number of iterations needed to converge to a high-quality solution. The key is ensuring that the applied operators do not interfere with each other. One way to achieve this is to use algorithms that operate on different parts of the solution space, such as exchanging pairs of customers in the route or reordering sub-sequences of the route. By acting on different segments of the solution space, the operators are less likely to interfere with each other and can work together to find a better solution more efficiently.

2.2.3. Metaheuristic Algorithms

Metaheuristics are a broad class of optimisation algorithms that enhance the probability of escaping the local optimums by exploring the usually vast solution search space [21]. Local optimums are defined as solutions that may be optimal only within a limited region of the solution space. In achieving the globally optimal solution, metaheuristics must explore the broader solution space and avoid getting trapped in the optima. This exploration process may involve accepting deteriorated objective function values and generating infeasible solutions, thereby diversifying the search space and increasing the likelihood of finding the global optimum [1]. Metaheuristics strike a balance between exploring the solution space and intensifying search efforts in specific regions, allowing them to search widely for new solutions while also improving solutions within targeted areas. Notable population-based metaheuristics for the VRP include the genetic algorithm [27], scatter search [28], ant colony optimisation [29], artificial bee colony [30], and particle swarm optimisation [31,32]. Furthermore, noteworthy neighbourhood-oriented metaheuristics for the VRP encompass simulated annealing [33,34,35], taboo search [36,37], variable neighbourhood search [38], iterated local search [39], and adaptive large neighbourhood search [40,41].

2.2.4. Hyper-Heuristic Algorithms

Hyper-heuristics are metaheuristics that provide adaptability to handle various problem instances by leveraging the observation that particular problem-solving approaches, i.e., heuristics, exhibit superior performance in specific cases. The domain encompasses two distinctive branches: selective and generative hyper-heuristics.

Through the dynamic selection of algorithms based on measurable instance features, selective hyper-heuristics enable the identification of the most suitable heuristic for each problem instance. This flexibility enhances the overall solution quality and effectiveness in VRPs by harnessing the strengths of different heuristics. Addressing the Algorithm Selection Problem (ASP) in dynamic VRP problem instances, the authors in [42] utilised a simulation-based supervised learning approach. They evaluated 72,000 instances using 432,000 simulations, comparing the performance of the Greedy and Replanning algorithms. The aim was to differentiate problem instances based on measurable features, after which the suggested framework would select the most suitable algorithm for each. The problem instances were distinguished via classification, regression models, and Artificial Neural Networks (ANN). The authors concluded that out of the two examined algorithms, the Greedy algorithm outperformed Replanning in particular problem instances, highlighting the need to leverage algorithms based on specific problem characteristics.

In another study, the authors provided a definition of algorithm building blocks and a measure of algorithm quality, where generative hyper-heuristics employ automated processes to evolve heuristics tailored to specific problem instances [43]. Unlike selective hyper-heuristics, which dynamically choose from pre-existing heuristics, generative heuristics explore algorithmic spaces, crafting novel heuristics optimised for unique routing challenges, demonstrating adaptability in dynamic and stochastic conditions [43]. Due to its flexible representation, one of the most prominent methods for automated heuristic generation methods is Genetic Programming (GP) [44]. One may find examples of approaches utilising GP in the domain of DVRPs and Electric Vehicle-DVRPs in the works presented in [43,44,45,46].

3. Dynamic Vehicle Routing Problem

Due to factors like the rise of on-demand services, increased order volumes, shorter delivery times, and customer demands for quick and reliable service [1], the traditional approach of static and deterministic VRPs has proven inadequate in meeting the evolving expectations of contemporary consumers. These models struggle to effectively address real-world scenarios since they assume that all relevant data is available from the beginning and remains constant throughout the complete route execution.

To address the challenges posed by dynamic and stochastic environments, researchers have recognised the limitations of traditional VRPs and shifted their focus towards Dynamic VRPs (DVRPs). One notable work in this area is the study conducted in [11], which focused on the Dial-A-Ride-Problem (DARP). This pioneering research introduced the concept of incorporating a time dimension into VRPs, emphasising the importance of adapting to changing conditions and integrating new information during the execution of routes. By dynamically allocating and reassigning customers to vehicles and optimising visit sequences based on evolving information, decision makers can effectively navigate dynamic and stochastic environments.

3.1. Degree of Dynamism

In the field of DVRPs, it is crucial to understand the concept of dynamism. As proposed in [47], dynamism encompasses two primary dimensions: the frequency of changes and the urgency of requests. The former refers to how often new information becomes available, such as adding or modifying customer requests. For instance, in a VRP for a parcel delivery service in an urban environment, the frequency of changes represents the dynamic nature of customer orders throughout the day, where requests may constantly evolve.

On the other hand, the urgency of requests observes the duration between the revelation and the anticipated service time. This dimension reflects the immediacy or time sensitivity of fulfilling customer requests. For instance, in a VRP for an emergency medical service, the urgency of requests depicts the criticality of medical emergencies. Requests with shorter time gaps between disclosure and service time require immediate attention and prioritisation in route planning.

Considering that both the frequency of changes and the urgency of requests is essential for researchers to understand the dynamic nature of VRP instances thoroughly, this understanding plays a crucial role in developing practical algorithms and strategies that effectively address the challenges posed by dynamic scenarios. Additionally, as highlighted in [2], the frequency of updates in problem information significantly impacts the available time for optimisation. As the system’s dynamism increases, generating a quick response becomes more difficult or costly [48]. Therefore, timely updates and efficient optimisation techniques are vital to adapt to the changing information.

Introducing the concept of dynamism in DVRPs, the authors in [49] proposed a metric called the Degree of Dynamism (DOD). The DOD is the ratio between dynamic customer demands and the sum of dynamic and static customer demands. The DOD provides insights into the proportion of dynamic requests to all requests, ranging from 0 to 1. A DOD value of 0 indicates a scenario where all necessary information is available before planning. In contrast, a value of 1 represents a situation where all the required data is unknown before planning.

Drawing upon the earlier research conducted in [49], the authors in [48] introduced the concept of the Effective Degree of Dynamism (EDOD). The EDOD represents the average request revelation time compared to the latest allowable time for their receipt (end of planning horizon). The EDOD serves as an indicator of system dynamism, with a value of 0 denoting a purely static system and a value of 1 representing a fully dynamic system. The authors further expanded the EDOD framework to incorporate instances with time windows, effectively accounting for the level of urgency of requests.

When decision makers must provide services within specific time windows, there is less flexibility to accommodate incoming requests. The reaction time, defined in [48] as the period between the request reception and the latest possible time to start the service, plays a crucial role. A longer reaction time provides more opportunity to incorporate the request into existing routes, which improves the ability to handle time-sensitive requests. Aiming to enhance the EDOD framework, the authors in [48] considered the relationship between reaction time and the remaining planning horizon, allowing for a more precise assessment of the urgency level associated with requests.

The enhanced framework categorised dynamic systems into three distinct categories: weakly, moderately, and strongly dynamic, corresponding to EDOD values lower than 0.3, between 0.3 and 0.8, and higher than 0.8, respectively [48]. Weakly dynamic systems demonstrate infrequent occurrence of immediate requests. Moderately dynamic systems exhibit a significant proportion of immediate requests in relation to the overall number of service requests. Strongly dynamic systems encompass rapid changes in the data and a high level of urgency associated with nearly all received requests.

The EDOD metric considers both the release and reaction times of dynamic requests, which measures the urgency and adaptability required to fulfil these requests. However, as highlighted in [2], it is essential to note that while the EDOD and its variations effectively capture the time-related aspects of dynamism, they may not encompass other potential sources of dynamism, such as geographical distribution or travel times between requests. These factors can significantly influence response time and optimisation efforts.

3.2. DVRP Taxonomy

Extensive research has been conducted on DVRPs since their formal definition, leading to significant advancements in understanding and addressing the challenges within this field. Notably, several comprehensive literature reviews have contributed to the collective knowledge in this area, including reviews presented in [2,3,4,5]. These reviews have synthesised existing research, identified key trends, and shed light on the complexities of DVRPs.

Within DVRPs, researchers have proposed various taxonomy frameworks to categorise and comprehend the various dimensions of the problem effectively. This paper will specifically focus on the taxonomy initially proposed in [4] and further extended in [5]. This taxonomy provides a comprehensive and structured framework for classifying DVRPs, allowing for a deeper understanding of the diverse aspects and characteristics inherent to these problems. This section will briefly outline the essential segments of the problem features of the proposed taxonomy. For a more thorough assessment of the taxonomy illustrated in Figure 6 and its related content, one may seek the literature review in [5].

3.2.1. Fleet Size

Various factors, including problem complexity, practical considerations, and research focus, influence whether to address single-vehicle or multiple-vehicle issues in DVRPs. Single-vehicle DVRPs, known as Dynamic Travelling Salesman Problems (DTSPs), optimise the routes and sequences of a single vehicle to meet customer demands. These issues are usually less complex than multiple-vehicle DVRPs, which require the coordination of multiple vehicles. One may choose the former alternative to establish basic principles and solution strategies (e.g., [50]) before tackling the complexity of multiple-vehicle scenarios (e.g., [51,52]). Choosing a single-vehicle or multiple-vehicle DVRP in practical scenarios is impacted by fleet size, operational demands, and available resources. If one vehicle is sufficient to complete the task, such as in small deliveries, then a single-vehicle DVRP is appropriate. However, for large instances, such as urban logistics requiring a fleet of vehicles, multiple-vehicle DVRPs are more suitable.

3.2.2. Time Constraints

In DVRPs, time constraints are fundamentally categorised as either hard or soft. Hard constraints must be strictly satisfied, while soft constraints allow for temporal flexibility. This flexibility accompanies penalisations in the objective function in cases of time window violations [5]. For instance, a hard time window means the vehicle must arrive at the customer’s location within the specified time window, and any deviation from it is not allowed. In contrast, a soft time window allows for temporal flexibility. In this case, the objective function is penalised to ensure the overall minimisation of delays (i.e., maximise customer satisfaction) if the vehicle arrives outside the specified time window.

In scenarios where new customer requests continuously arrive and there is a limited number of vehicles, coupled with hard time windows, it may only be possible to accommodate all customers by violating time windows [4]. Therefore, instances with soft time windows are more realistic and practical in the evolving dynamic environment. In addition to hard and soft constraints, the authors in [5] observed that researchers tend to exclude time constraints altogether, thus simplifying the problem and concentrating on specific aspects of the instance.

Researchers may opt for including both hard and soft time constraints. Addressing a variant of the field service routing problem, the authors in [53] considered two types of customers: mandatory (hard time window) and optional (soft time window). Hard time window requests had to be served within their designated time frames, while soft time window requests could either be served or not, i.e., their service priority was secondary to mandatory requests. The objective of their study is to maximise the number of optional customers visited while minimising travel time.

3.2.3. Vehicle Capacity Constraints

The capacity constraint is vital in accurately modelling real-world DVRPs. Two main options are available: incapacitated instances, where there is no limit on the vehicle’s capacity and instances with capacity constraints, i.e., capacitated instances. Problems in which the transported entities are much smaller than the vehicle’s capacity often employ uncapacitated instances (e.g., courier services, [54]). In these cases, the capacity can be treated as infinite, resulting in a simplified problem formulation and analysis.

On the other hand, there is a large body of instances which necessitate the presence of vehicle capacity constraints to ensure that the vehicle’s capacity is within its physical limitations. For instance, in waste collection problems [55], delivery services [56], or ride-sharing systems [57].

3.2.4. The Ability to Reject Customers

The distinguishing characteristic of DVRPs is their ability to accommodate customer rejection, a feature generally absent in classical VRPs. In traditional VRP instances, given their static and deterministic nature, it was conceivable to generate feasible solutions for the complete request pool [58]. While the majority of classic VRPs exclude the notion of rejection, there is a body of research that incorporates such actions. For instance, the work presented in [59] explored the VRP with zone-based pricing and the capacity to reject services and proposed a branch-and-price algorithm.

However, DVRPs introduced a dynamic environment where the customer requests continuously unfold, accompanied by their associated constraints (e.g., time or capacity). This realisation prompted decision makers to acknowledge that not all instances may have a feasible solution that serves every customer. Consequently, the ability to reject customer requests is a vital component of DVRPs. The underlying principle guiding this inclusion is the recognition that serving all customers may not be possible. Furthermore, via future stochastic information evaluation (SDVRP), long-term benefits can be attained by strategically rejecting requests. In contrast to prioritising immediate rewards via myopic decision making, it becomes conceivable to anticipate that rejecting a particular customer could lead to greater rewards in the future [60].

3.2.5. Sources of Dynamism

In the articles reviewed in this paper, the online arrival of customer requests is identified as the most common source of dynamism. This occurrence is the case with articles associated with the deterministic (e.g., [11,61,62]) and the stochastic instance of the DVRP (e.g., [57,63,64]). One may link the prominence of customer requests as a primary source of dynamism to the fundamental principle of dynamic routing. Dynamic routing harnesses the flexibility to redirect vehicles towards nearby requests as they arise, enabling real-time accommodation of new customer requests. This dynamic adaptation of the routing plan to include nearby requests leads to improved overall customer service levels and substantial cost savings for logistics operations [2]. Alongside customer requests, as a part of this research, several other dynamic sources have been identified: customer demands (e.g., [65]), travel times (e.g., [66,67,68]), and service times (e.g., [53]) as present sources of dynamism in DVRPs. It is noteworthy to mention that some authors utilise several sources of dynamism.

This review has identified a common approach among articles explicitly addressing the modelling of real-world dynamism: Poisson processes and distributions. This distribution describes the occurrence of events in continuous time. In the scope of the reviewed literature, it was found that authors generally utilised the technique to represent random arrivals or events unfolding over time. For example, the work presented in [69] employed a proactive deployment strategy, considering the likelihood of future calls in different service area regions. The arrival of customer requests followed a Poisson process, with customers placing calls independently and distributing pick-up and drop-off locations uniformly. A proactive real-time routing approach proposed in [70] integrated dummy customer requests with different service times and window values. The historical request arrivals, modelled as a time–space Poisson process, generated stochastic knowledge about expected future requests. The approach removed the dummy requests when their appearance in the assigned area became highly unlikely. Considering the arrival of requests as a Poisson process, the authors in [71] introduced optimal social tolling for queues in a multi-server queue approach for DVRPs. One may find further examples utilising Poisson processes to model request arrivals and develop strategies to address DVRP challenges in the works presented in [52,64,72].

3.2.6. Sources of Stochasticity

The observed sources of stochasticity are analogous to the sources of dynamism. It is noteworthy to mention that some authors utilise several sources of stochasticity, but customer requests are still a prominent factor (e.g., [56,63,73,74]). One approach to leverage the stochastic nature of customer requests involves utilising advanced techniques in taxi services. For instance, the authors in [75] proposed a distributed optimised framework called DeepPool, employing deep reinforcement learning to optimise vehicle dispatching policies. Their framework utilised historical taxi trip records in New York City to train deep neural networks. Similarly, a deep reinforcement learning framework for taxi dispatching was developed in [76]. The framework utilised policy-based algorithms to optimise rebalancing strategies. The authors approximated policy and value functions using neural networks to estimate optimal dispatch strategies and expected costs. Additionally, the work presented in [57] integrated optimisation, machine learning, and model predictive control to enhance real-time dispatching in ride-sharing systems. Their algorithm utilised predictive control optimisation and machine learning models to predict zone-to-zone demand and relocate idle vehicles. These studies exemplify how historical data and advanced techniques can anticipate customer requests and optimise operations in taxi services, reducing waiting times and enhancing efficiency in meeting customer demands. Alongside customer requests, we have identified dynamic customer demands (e.g., [77]), travel times (e.g., [67,78]), service times (e.g., [79]), vehicle speeds (e.g., [80]), and vehicle availability (e.g., [68]) as present sources of stochasticity in DVRPs.

3.2.7. Objective Functions

Objective functions in DVRPs aim to optimise various measures, often combining multiple criteria. For instance, the authors in [76] proposed a multi-period game-theoretic model to minimise total ride-sharing costs and address imbalanced spatiotemporal distributions in ride-sharing systems. Focusing on minimising average waiting times in ride-sharing services, the authors in [57] utilised machine-learning models for demand prediction and predictive control optimisation for vehicle relocation. Employing a greedy randomised adaptive search approach, the authors in [81] aimed to maximise customer requests while minimising costs in the dynamic ride-sharing problem for taxis. By matching drivers with high-price orders, the approach in [82] sought to maximise accumulated driver income in multi-agent ride-sharing systems. In the context of bike-sharing systems, the approach in [63] aimed to maximise the quality of service via dynamic lookahead policies and value function approximation. The objective was to avoid unsatisfied demands by dynamically relocating bikes throughout the day. Focusing on courier food delivery, the authors in [74] considered objectives such as maximising total earliness, minimising waiting time, and optimising ready-to-delivery time. In emergency medical services, the authors in [83] aimed to efficiently redeploy idle ambulances to maximise the number of calls served within a specified delay threshold. The approach in [84] employed value function approximation in the dynamic ambulance dispatching and relocation problem to minimise response time. The method achieved a significant 12.89% reduction in the average response time.

The costs involved at the operational level of DVRPs primarily consist of travel time and distance. Rural areas face significant cost factors due to travel distance, whereas urban environments prioritise travel time [85]. In specific applications such as emergency services, operational costs become secondary to reliability, which includes fast service, catering to a large customer base, adhering to time windows, and ensuring high customer satisfaction. Furthermore, in e-commerce, timely and dependable deliveries are critical for customer satisfaction [85]. Overall, objective functions in DVRPs encompass a range of measures and criteria, considering operational costs and customer satisfaction factors like reliability and punctuality.

3.3. DDVRP Solution Methods

Deterministic Dynamic Vehicle Routing Problems (DDVRPs) gradually reveal critical information as the planning horizon progresses. This rate of information evolution challenges exact methods that can only provide optimal solutions based on the current problem state [2]. Therefore, dynamic approaches primarily rely on heuristic methods, which promptly compute solutions based on the current state of the problem.

ReOptimisation (RO) is a commonly used methodology in DDVRPs, where the routing plan is determined based on available information at decision points. Decision points are specific moments in the problem-solving process that require the solver to make decisions. Decision points can occur due to changes in the available data (event-driven) or at fixed intervals [2]. Due to the constant evolution of the DVRP problem settings, mitigating the number of unfulfilled dynamic customer requests necessitates a prompt and continuous assessment of the problem. Infrequent reoptimisations may lead to delayed responses to dynamic customers, decreasing the total number of served customers [86].

In the context of DDVRPs, reoptimisation methods frequently adopt a myopic perspective, prioritising the optimisation of the current problem state rather than anticipating future events, solely focusing on immediate responses and rewards. Myopic approaches, also known as greedy approaches, do not consider or evaluate the impact of current actions on future states of the problem. Despite their lack of anticipation, these approaches offer a significant advantage by utilising well-established heuristic and metaheuristic methods to address a static and deterministic problem instance at each decision point. By leveraging these established techniques, robust and effective solution strategies can be applied at each decision point, enhancing the overall effectiveness of the solution process.

Reoptimisation approaches in DDVRPs can be classified as periodic or continuous. Periodic reoptimisation involves updating routing solutions at predetermined intervals or defined events, while the latter enables ongoing, real-time adjustments based on new information and changing conditions. Strategies that mimic intuitive route planning, such as emulating a dispatcher’s decision-making process, are commonly used in DDVRPs. These strategies include waiting, relocation, and buffering [2]. Although Policy Function Approximations (PFAs) are typically associated with stochastic DVRPs, these strategies employed in DDVRPs align with the narrative of PFA. While not directly evaluating actions based on their future value, they focus on selecting actions likely to result in flexible states and yield better future values [87]. Therefore, PFA methods represent an initial step away from myopic route selection towards considering potential future stochasticities [88]. These methods involve decision rules that aim to emulate intuitive route plans.

3.3.1. Reoptimisation Approaches

Periodic reoptimisation approaches in DDVRPs involve updating the routing solution at predetermined intervals or event-driven decision points. These approaches offer the advantage of employing algorithms developed for static routing, with the main drawback being the increased delay for the dispatcher due to the need to perform a complete optimisation before updating the routing plan [2]. To efficiently route and schedule a fleet of supply vessels, the authors in [89] employed periodic reoptimisation. The reoptimisation occurred whenever new information became available, such as new orders. The authors in [81] applied periodic reoptimisation to the dynamic ride-sharing problem for taxis. The approach triggered the reoptimisation at predefined periods in the day. In tackling routing refuelling trucks at an airport, the work presented in [90] employed periodic reoptimisation whenever new information about the problem emerged. Authors in [11] utilised periodic reoptimisation in their study on the DARP. The method updated and reoptimised the route whenever a new request arrived.

In contrast to temporal or event-driven reoptimisation approaches, continuous reoptimisation approaches constantly evaluate and optimise their solution pool regarding dynamically revealed information. The constant adaptions and adjustments of the solution pool enable these approaches to accommodate the realised dynamic event via reoptimisation techniques. Reoptimisation is employed on solutions deemed fittest for the arrived event.

Various algorithms have been employed in the literature to ensure real-time adaptation. Addressing an express courier service with dynamic pickup and delivery requests, traffic congestion, and vehicle disturbances, the work presented in [91] employed continuous reoptimisation by proposing a real-time control approach that continuously adapted the route plan while simultaneously executing it. This approach efficiently handled newly arrived requests, traffic congestion, and vehicle disturbances. The authors executed plan adaptations via tabu search, which dynamically changed neighbourhood operators based on the quality of recently explored solutions. To optimise planned routes of vehicles in real time, the authors in [54] introduced neighbourhood search heuristics. Their continuous reoptimisation approach allowed for rejecting incoming requests that could not be serviced within time window constraints or without violating vehicle capacity constraints. The process explored new solutions and continuously adapted the routing plan as new requests arrived. The authors in [92] employed continuous reoptimisation in their work on dynamic assignments of customer requests. The strategy involved continuously reoptimising the route plan by considering fixed vehicle locations as origins and allowing for vehicle diversion from their current destinations. Addressing DDVRPs, the authors in [93] used continuous reoptimisation in their tabu search procedure. The approach maintained an adaptive memory storing a pool of fit solutions and employed it to generate initial solutions for a parallel tabu search. Whenever a new customer request arrived, the solution in the adaptive memory was evaluated to decide whether to accept or reject the request, dynamically adjusting the routing plan. Their continuous reoptimisation approach focused on minimising costs and dispatching customer requests with soft time windows in real time. One may find additional reoptimisation approaches in the domain of DDVRPs in the works presented in [61,62,94,95,96].

Two vital questions regarding reoptimisation have been raised and tackled in the literature. Foremost, the authors in [97] tackled the question regarding when replanning should be triggered, as real-time information allows for replanning at any point. The paper analysed the impact of three primary triggers for replanning: exogenous customer requests (replanning triggered after a new customer request), endogenous vehicle statuses (replanning started after a vehicle finished serving a customer), and fixed interval replanning. The results indicate that fixed interval triggers are inferior. Given the circumstances, one should favour either the exogenous or endogenous triggers. The former is advantageous in widely spread customers with long travel durations and few dynamic requests. For instance, if a new customer emerges geographically between the current and subsequent customer, it would be beneficial to serve that customer before the subsequent one. On the other hand, the latter performs best in instances with many dynamic requests and clustered customers.

Furthermore, the danger of redirecting/reoptimising too often is addressed in [70]. Frequent diversions can lead to visual and cognitive distractions for drivers, potentially contributing to accidents. To mitigate this risk, the paper proposes a general approach to control the number of diversions by implementing a penalty cost-based threshold that assesses the significance of potential improvements. The study also classifies diversions based on their impact on drivers and introduces two driver profiles. These findings contribute to understanding the optimal triggers for replanning in DVRPs and highlight the importance of managing diversions to maintain driver safety and efficiency.

3.3.2. Strategies Emulating Effective Decision Making

Policies are rules that specify the control to apply at each possible state that can occur [98]. These approaches aim to select actions that lead to flexible states and better future values, even though they do not explicitly evaluate actions based on their future value. Thus, moving beyond deterministic frameworks, these approaches pave the way for addressing SDVRPs.

In the context of the waiting strategy addressed in [99], the authors maximised the probability of feasibly inserting dynamic requests into pre-determined routes. Their approach involved distributing the available waiting time over the locations of static customers to enhance the likelihood of accommodating real-time requests. The authors demonstrated the advantages of appropriate waiting strategies. The proposed strategy reduced the probability of unsuccessful customer servicing and average detour length. For the buffering strategy presented in [87], the authors proposed a request buffering approach in the context of the Dynamic Pickup and Delivery Problem with Time Windows (DPDPTW). Their strategy involved postponing the assignment of non-urgent requests to vehicles during the route planning phase, allowing for more efficient handling of urgent requests. The results demonstrated improvements in terms of lost requests compared to conventional approaches. This buffering strategy enabled better adaptation to dynamic scenarios and enhanced the performance in handling requests. Regarding the relocation strategy presented in [100], the authors focused on the redeployment problem for a fleet of ambulances in the context of emergency medical services. The approach aimed to maximise the expected coverage by relocating idle vehicles to appropriate locations. Employing a parallel tabu search heuristic, the authors developed a dynamic ambulance dispatching and redeployment system, improving responsiveness to calls and reducing the cost of redeployment.

4. SDVRP Solution Methods

A distinctive feature of SDVRPs manifests in the presence of dynamism and stochastic information. Unlike myopic approaches prioritising immediate responses and rewards, effective SDVRP solution methods utilise historical and dynamically revealed information to anticipate future stochastic information and guide decision making. Decision makers continuously sample future scenarios, adapting service sequences and routes to changing conditions and probability distributions.

In their classification of SDVRP anticipation methods, the authors in [85] distinguished between reactive and non-reactive methods. Reactive methods use Bellman’s Equation [101] to approximate state or state-action pair values within Markov Decision Processes (MDPs). These methods provide accurate modelling of the environment and stochasticity but can be complex due to their reliance on a model. Non-reactive methods, on the other hand, are model-free and often involve simulating future scenarios and sampling realisations to make decisions.

Another differentiation is between offline and online anticipation methods. Offline approaches compute and evaluate values before encountering them, relying on a closed-form model for manageable approximation. In contrast, online methods compute values on demand during execution without an explicit model. At the deployment of the solution approach, offline methods have an initial advantage due to their pre-learning process. However, given an infinite time horizon, both methods are known to converge to an optimum.

4.1. SDVRP Policies

This section portrays reactive methods as models governed by decision-making strategies, known as policies. Policies specify which action to take at each possible state that can occur [98]. Four broad policy categories are distinguished, as described in [102]: Myopic and Lookahead policies, Value Function Approximations (VFAs), and Policy Function Approximations (PFAs).

Myopic policies focus on optimising immediate costs or rewards, considering only the current state of the problem. They do not evaluate the impact of present actions on future states and are known for generating fast, greedy solutions.

Lookahead policies recognise the benefit of evaluating actions’ impact on future states. However, they also acknowledge the complexity of assessing the entire problem’s time horizon. As such, lookahead policies partition the problem into manageable segments, such as temporal segments (see Figure 7). They approximate and optimise a certain K number of steps into the future but return an action only for the current time-step state. These policies incrementally advance the planning horizon and are often referred to as Rolling Horizon (RH) procedures. The main downside of RH procedures is that it performs all simulations online. As such, the available (response) time to generate solutions restricts the quality of solutions, further underscored as the solution space increases.

The following VFA and PFA policies will be discussed in detail in Section 4.5.1 and Section 4.5.2, respectively. VFAs implicitly learn policies by deriving them from learned value functions. The value functions estimate either the sum of (discounted) rewards associated with a given state-action pair, known as the Q-value, or the long-term value of a state, known as the state-value function. The (discounted) rewards denote the recursive sum of rewards subject to the agent starting from a given state or taking a given action in a state and following its policy onwards. VFAs construct policies by assigning expected long-term rewards to each state or state-action pair, enabling the agent to choose actions that yield the best predicted cumulative results when encountering a state.

In contrast, PFA interprets a policy as a probability distribution over the action space, allowing the agent to sample actions from this distribution to make a decision. PFAs directly learn the policy without the need to learn a value function. This approach mimics effective decision-making and returns actions based on the current state.

4.2. Markov Decision Process

Markov Decision Processes (MDPs) are mathematical models that capture sequential stochastic decision making in dynamic environments. They form the foundation for more elaborate approaches like Reinforcement Learning (RL) and Approximate Dynamic Programming (ADP) [103].

As illustrated in Figure 8, an MDP consists of two key components: the agent and the environment. The agent’s primary roles are learning and making decisions, which involve interacting with the environment by taking actions. The environment encompasses all external factors with which the agent interacts. Each interaction between the agent and the environment results in an immediate response, referred to as a reward, and leads to a transition to the next state. In such a system, the agent’s objective is to sequentially select actions that maximise cumulative rewards while considering the environmental stochasticity [104]. The outcome of solving an MDP is a policy.

In the context of SDVRP, the MDP framework described in this section serves as the foundational basis for subsequent solution methods, and, as such, it may not align with a canonical definition. The fundamental SDVRP MDP model elements are:

Decision points, denoted as

K = {1, \dots, k}

, encompass discrete (finite) stages in the process time horizon, i.e., time steps. The MDP process consists of an iterative interaction loop of subsequent decision points. The agent evaluates the state variable at each decision point

k \in K

and selects an appropriate action [102]. The final decision point contains termination criteria, usually modelled not to include further rewards.

State space, denoted as

S = {s_{1}, \dots, s_{k}}

, encompasses all possible states of the system. Each state

s_{k}

captures relevant information about the system’s current configuration. In the context of SDVRPs, the state space may encapsulate exogenous (e.g., customer-related information as demands and locations and environment-related information as traffic and parking conditions) and endogenous (e.g., fleet-related information as vehicle locations and statuses, fleet capacity, and current routing plans) information.

Action space, denoted as

X = {x_{1}, \dots, x_{k}}

, encompasses the range of potential actions available to the agent within a given state. Actions trigger the transition from the present state to another within the set S. The agent may select actions using a method called a policy

x_{k}^{π} (s_{k})

, or it may choose a random action. The action at a decision point may be to decide whether to accept or reject a customer or whether and where to relocate an idle vehicle. Furthermore, actions can involve explicit modifications to the routing process. For instance, via reoptimisation, utilising inter and intra-improvement heuristics.

Reward function

r (s_{k}, x_{k})

is the immediate outcome of taking action

x_{k}

in state

s_{k}

. Rewards define the goal of the process and help shape an idea of which actions are beneficial and which are not. If a specific reward is positive and substantial, this may indicate that taking action x in state s is more valuable for improving the objective. Comprising positive and negative values, rewards at each decision point can serve as feedback to refine and guide the agent. If an action taken by a policy yields low or negative rewards, then the process may alter the policy to select some other action in that situation in the future.

Stochastic information

w_{k + 1}

. After the agent executes an action

x_{k}

in a given state

s_{k}

, the stochastic information,

w_{k + 1}

, gets revealed. This information was not available to the agent at time step k. It encompasses exogenous and endogenous information, such as changes in traffic conditions, vehicle statuses, or new customer request arrivals.

Transition function

s^{M} (s_{k}, x_{k}, w_{k + 1})

quantifies the probability of an agent moving from one state to another, i.e., it describes how the state evolves from one decision point to another. With the input parameters of the current state

s_{k}

, action taken

x_{k}

, and stochastic information revealed

w_{k + 1}

, the transition function determines the subsequent state

s_{k + 1}

. This transition probability, expressed as

P (s_{k + 1} | s_{k}, x_{k}, w_{k + 1})

, indicates the probability of transitioning to

s_{k + 1}

, a post-decision state, given the current knowledge at k.

Policy

Π

defines how the agent should behave in a given situation. The policy represents a distribution that guides decision making by mapping states of the environment to actions to perform in those states. A policy

π

is a rule that specifies the action to take in a given state. As stated in [85], a strategy

π \in Π

consists of a series of action rules, one for each decision point

k \in K

and each decision rule, denoted as

x_{k}^{π} (s_{k})

. The rules determine the action

x_{k}^{π}

to be performed by the agent when the system is in state

s_{k}

.

Objective function in SDVRP defines the system’s goal. It implicitly evaluates the agent’s performance at a given point in time, aiming to maximise a cumulative long-term contribution function (e.g., total profit or the number of served customers) or to minimise a cost function by selecting appropriate state-action pairs throughout the time horizon. The agent seeks to optimise the given policy to maximise future rewards, i.e., to achieve the best outcome considering the set goal. The guiding objective is finding an optimal decision policy

π^{*}

.

Figure 9 illustrates the sequential decision-making process within an SDVRP MDP. At each time step k, the agent is in a specific state

s_{k}

. The agent then chooses an action based on its policy

π (s_{k})

. This policy reflects the agent’s best estimation of the optimal action considering its current understanding of the system, known as On-Policy exploitation. However, the agent may also opt for Off-Policy exploration by taking a random action to explore the action space and potentially discover better actions. Following the agent’s action, the environment immediately provides a reward

r_{k}

based on the reward function, which can be used to adjust and improve the agent’s policy. After the action, the environment transitions to a post-decision state, guided by the transition function

s^{M} (s_{k}, x_{k}, w_{k + 1})

. This function considers the state at time step k,

s_{k}

, the taken action,

x_{k}

, and revealed dynamic, stochastic information

w_{k + 1}

not previously known to the agent. Via these inputs, the transition function probabilistically determines the next state

s_{k + 1}

. This process continues until the agent encounters a termination condition. It is important to note that the dotted lines represent elements that may not be present in all SDVRP MDP instances and will be explored further in Section 4.3.

For a comprehensive, step-by-step guide on successfully modelling SDVRP as an MDP, one may refer to the work presented in [85]. Furthermore, in their literature review on SDVRP modelling techniques, the authors in [88] identified a gap in the representation of route-based MDPs. Due to the dimensional complexities inherent in such models, researchers opt to solve the assignment-based SDVRPs, rather than the route-based SDVRPs. Therefore, the authors proposed an MDP modelling framework incorporating traditional routing plans. This framework goes beyond standard actions, such as solely selecting the next customer to visit, by additionally considering the remaining route to be assigned by the agent to a vehicle. As such, at the expense of increased complexity, the framework offers a more comprehensive and accurate representation of the SDVRP.

4.3. Model-Based vs. Model-Free Reactive Approach

The primary objective of MDPs is to determine the optimal policy, specifying the best action for each state to maximise cumulative rewards. Achieving this objective can be approached using various methods.

4.3.1. Model-Based Approach

Within the domain of MDP solution methods, one category employs a closed-form model of the environment. This model allows the agent to anticipate the consequences of its actions. These methods, known as Model-based approaches, encompass learning the underlying MDP transition and reward functions, essentially learning the governing rules. The rules, forming the model, encapsulate the reward and transition functions, enabling the agent to estimate the outcomes of its actions.

As depicted in Figure 10, the agent engages in the refinement and optimisation of both the agent and the model via interactions with the environment. Refining the model using experiences from the actual environment involves iteratively adjusting the transition and reward functions with the aim of the model by more accurately estimating the expected results, i.e., decreasing the delta between the approximated and actual outcomes of actions.

The simulated model contains knowledge that enables the agent to estimate the value function or policy without directly interacting with the actual environment. However, a notable limitation of this approach is its sensitivity to the accuracy of the transition and reward functions. The results may deviate if these functions do not accurately represent the actual environment. Furthermore, determining transition probabilities can be challenging in complex, large-scale systems, leading to the curse of modelling.

Dynamic Programming (DP) serves as a prominent example of a model-based approach. DP relies on an explicit model of the MDP, including transition probabilities and immediate rewards. With this knowledge, DP computes the optimal policy via iterative processes, propagating value updates across states and actions [105].

4.3.2. Model-Free Approach

On the other hand, Model-free methods, exemplified by the Q-learning algorithm [106] in RL, do not rely on a closed-form model of the environment. In this context, the model pertains to any information the agent can employ to predict the environment’s response to its actions [107]. Model-free approaches circumvent the curse of modelling by excluding the direct incorporation of transition and reward functions.

These methods directly learn the value of states or state-action pairs from interactions, bypassing the need for the explicit modelling of transition dynamics or reward functions. A common approach is to set initial values of states or state-action pairs and then utilise trial-and-error learning, where the agent explores various actions and learns from the outcomes, refining its knowledge based on the rewards received. Notably, this approach balances exploration and exploitation, combining both On-Policy (choosing the best-known actions) and Off-Policy (trying new actions) strategies.

Model-free techniques, such as Q-learning, engage directly with the environment via a continuous cycle of exploration and exploitation. They learn from real-world experiences, observing the consequences of actions without relying on a predefined model [108]. Instead of anticipating specific future states and rewards, these methods focus on approximating Q-values, representing the expected cumulative rewards associated with taking particular actions in specific states [109]. The agent aims to decrease the delta between the actual benefit of a state or state-action pair and its previously known, approximated benefit, i.e., value function.

While model-based approaches, like DP, can provide accurate predictions with an explicit model, they often face challenges when applied to complex real-world SDVRP problems due to the demanding computational requirements associated with constructing and maintaining detailed models [103]. In contrast, model-free methods offer a more flexible and versatile approach, as they can adapt and learn directly from the environment without the need for complete knowledge of the environment, including transition probabilities and reward functions [102].

4.3.3. Prominent Frameworks

Approximate Dynamic Programming (ADP) can be viewed as a hybrid approach that bridges the gap between model-based and model-free methods in solving MDPs. Similar to model-based methods, ADP relies on an explicit model of the MDP, including transition probabilities and rewards [85]. However, the essential distinction lies in ADP’s ability to handle situations where the explicit model might be imperfect or computationally expensive to use due to the curse of dimensionality [102]. ADP employs iterative algorithms to refine value functions or policies, similar to DP, by propagating value updates based on the explicit model. However, instead of computing exact solutions, as in DP, ADP utilises statistical approximation techniques based on observed samples and interactions [103]. Rather than enumerating all

s \in S

, ADP approximates values of states that the agent might visit, i.e., states that the process observed and only states that might be reachable from any previously explored state [110]. ADP is forward-planning-oriented, typically using simulations to approximate anticipated rewards. It can adapt its decision-making strategy through trial-and-error [103]. ADP is particularly suitable for settings without full knowledge of the environment [85].

Reinforcement Learning (RL) is a versatile computational approach that automates goal-directed learning and decision-making processes [108], making it applicable to SDVRPs. It extends MDP concepts by enabling learning from direct interactions with the environment, allowing the agent to adapt and refine its decision-making policy over time. Unlike traditional model-based methods, which require a perfect closed-form model, RL methods indirectly learn the reward function and transition probabilities via direct interaction. A key element in RL, the Q-function, denoted as

Q (s, x)

, represents the quality of taking action x in state s. Unlike model-based methods, which estimate the value

V (s)

of being in a state s, RL focuses on learning the value

Q (s, x)

of taking action x while in a state s. This approach enables RL to optimise its policy directly and iteratively without anticipating future states. While model-based instances of RL exist, the core strategy in RL involves learning the Q-function without explicit knowledge of the model. RL agents explore the environment through trial-and-error, refining their decision-making strategy by continuously improving the Q-function. This flexibility makes RL suitable for scenarios where the model is either unknown or too complex to model explicitly [102].

4.4. Approximation in the Scaling SDVRP Environment

In DP problems involving finite and small dimensions, it is feasible to construct a lookup table to store exact deterministic values for each state

V_{k} (s_{k})

and corresponding state-action pair values

Q (s_{k}, x_{k})

. The principle of optimality allows us to derive the optimal cost function via backward induction [105]. In other words, knowing the value of state

V_{k + 1} (s_{k + 1})

, via Bellman’s optimality equation and backward recursion, it is feasible to determine the value of being in state

V_{k} (s_{k})

. Thus, optimal values for each state

s_{k} \in S

can be calculated iteratively by solving the maximisation problem [102]. However, this approach becomes impractical when dealing with large or infinite dimensions.

In SDVRPs, applying Bellman’s equation and backward recursion faces limitations due to the “three curses of dimensionality”: the vast sizes of the state space, action space, and stochastic information space [102]. The computational complexity arises when attempting to enumerate all state-action pairs, considering stochastic information and performing backward recursion to obtain values. Consequently, exact solutions for optimal policies are unattainable within a finite time, leading to the adoption of approximation techniques such as anticipatory decision policies [85].

Rather than enumerating all states, approximation techniques aim to approximate the expected rewards associated with a finite number of sampled states. The quality of estimates improves with more frequent samples and updates. However, when values are accessed infrequently, the quality of the approximation and the overall solution may suffer [108].

Simulating probable future events, i.e., stepping forward in time, poses two significant problems: generating suitable samples of possible stochastic outcomes and defining a decision-making method [103]. These challenges make solving SDVRPs a complex task, involving searching for actions in a complex mixed-integer program and evaluating each action in the context of potential future changes and subsequent actions [8].

4.5. Model-Induced Reactive Solution Methods

As model-induced methods encompass various architectural variations, each with distinctive characteristics, it is imperative to account for the problem’s features, as well as temporal and dimensional constraints. For a comprehensive overview and detailed explanation of the set of ADP and RL methods, readers are encouraged to consult the works presented in [102,103,105,108]. The following subsection will briefly introduce the most common approaches containing a model found in the SDVRP literature. The literature is divided into two main groups: solutions with a route-based MDP and solutions with an assignment-based MDP.

Assignment-Based MDP: Several reviewed studies fall under the assignment-based MDP category of approaches. Due to the state- and action-space curses of dimensionality, these methods primarily focus on decisions concerning the feasibility and benefit of accepting (or rejecting) dynamically arrived requests. Upon acceptance, the decision allocates the request to the most suitable vehicle and a routing decision is triggered (e.g., a myopic heuristic solver). To alleviate the curse of dimensionality, the complete routing information, defining the routing sequence of vehicles, is left out of the scope of the state space.
Route-Based MDP: In contrast, route-based MDPs enhance the complete set of spaces to include routing information, making it easier to construct or reoptimise routes upon new information disclosure [88]. This approach is practical when the solution approach heavily relies on the information of planned routes to guide online decisions. As observed in [86], the state space incorporates the complete routing sequence. Regarding the remaining available service time of a vehicle, the authors used the routing information to efficiently assess the cost of assigning future requests to a vehicle by predicting its location along its planned route.

Figure 11 illustrates the aforementioned MDP approaches (the images were originally exhibited in [88]). Figure 11a,b displays an example of an SDVRP transitioning from one state to the subsequent, followed by the revelation of stochastic information (e.g., state

s_{k}

anticipates customer 8, while that same customer is later revealed in

s_{k + 1}

). In contrast to the assignment-based model, the route-based variant transfers the complete tentative route (denoted via bold dashed lines) from

s_{k}

to

s_{k + 1}

. As observed in Figure 11b, the left image contains a route originating from the prior state (

s_{k - 1}

), while the middle image displays a route subject to the action taken in the current state (

s_{k}

).

4.5.1. Value Function Approximation

Value Function Approximation (VFA) is a fundamental concept in ADP and RL to estimate the expected long-term value of being in a specific state

V (s)

or executing a given state-action pair

Q (s, x)

. It is vital in enhancing decision-making efficiency by approximating the value of different states or state-action pairs via a value function. These value functions assign numerical values to states (

V (s)

) or state-action pairs (

Q (s, x)

) and guide the agent’s policy. In the context of value functions, Equations (1) and (2) represent key concepts.

Equations (1) and (2) parameters:

E^{π}

denotes the expectation operator concerning the policy

π

, meaning it calculates the expected value over all possible outcomes according to the policy.

r_{k}

represents the immediate reward obtained by the agent when transitioning from state

s_{k}

to

s_{k + 1}

.

γ

is the discount factor, a value between 0 and 1, which reflects the agent’s preference for current rewards over future rewards. It influences the weight given to future returns in the cumulative return calculation.

V^{π} (s_{k + 1})

is the expected cumulative return from the next state

s_{k + 1}

onward, following the same policy

π

. Key value functions:

V (s): This term denotes the expected cumulative return an agent can achieve when it starts in a specific state $s_{k}$ and follows its policy $π$ from that state onwards. In other words, it represents the potential long-term benefit of being in state $s_{k}$ , where the considered state does not have to be mandated by the policy.

$V^{π} (s_{k}) = E^{π} [r_{k} + γ \cdot V^{π} (s_{k + 1}) | s_{k}]$

(1)
(s,x): This term represents the benefit of taking action x while in state s and following the agent’s policy $π$ from that action onwards. It is vital to note the considered action, denoted as x, does not have to be strictly dictated by the policy. The agent can explore actions beyond those recommended by the current policy to gather valuable information and optimise its decision-making process.

$Q^{π} (s_{k}, x_{k}) = E^{π} [r_{k} + γ \cdot Q^{π} (s_{k + 1}, x_{k + 1}) | s_{k}, x_{k}]$

(2)

It is worth highlighting that VFA does not necessarily approximate the complete state-action space. Instead, it estimates the values of a subset of state-action pairs that the agent encounters or deems significant during the learning process, necessitating a balance between two vital aspects of learning: on-policy exploitation and off-policy exploration [109].

To begin, the agent may have initial random estimates of values and then gradually refine these estimates via trial-and-error learning. The value function is subject to continuous learning and refinement. Agents learn from their experiences and adaptively refine their policies to minimise the delta between estimated and actual values. This learning process ultimately results in more accurate representations of expected long-term rewards [102].

Among the most prominent methods in the literature, two widely used approaches are Monte Carlo learning and Temporal Difference (TD) learning.

Monte Carlo learning is a relatively straightforward method for updating a value function. It calculates the updates based on the accumulated total rewards (discounted by $γ$ ) in a learning episode, divided by the number of steps within that episode. While conceptually simple, this approach can be sample inefficient, especially when dealing with complex problems. Monte Carlo learning assigns equal significance to all states or actions in the episode and updates them equally. Equations (4) and (5) represent the Monte Carlo learning for state-value and state-action value functions.

R_{Σ} = \sum_{k = 1}^{n} γ^{k} r_{k}

(3)

V^{n e w} (s_{k}) = V^{o l d} (s_{k}) + \frac{1}{n} (R_{Σ} - V^{o l d} (s_{k})) \forall k \in [1, \dots, n]

(4)

Q^{n e w} (s_{k}, x_{k}) = Q^{o l d} (s_{k}, x_{k}) + \frac{1}{n} (R_{Σ} - Q^{o l d} (s_{k}, x_{k})) \forall k \in [1, \dots, n]

(5)

Temporal Difference (TD) learning emphasises the influence of more recent events on the rewards. The core idea here is that recent rewards may have a greater impact on the accumulated rewards due to the presence of a discount factor ( $γ$ ). A common instance of TD learning is TD(0), represented in Equation (6). This equation computes the TD error, denoting the difference between the TD target estimate (the observed reward and the discounted value of the next state) and the agent’s former best estimation of the state’s value.

\underset{\begin{matrix} New value \\ of state \end{matrix}}{\underset{︸}{V^{n e w} (s_{k})}} = \underset{\begin{matrix} Former best \\ estimation \end{matrix}}{\underset{︸}{V^{o l d} (s_{k})}} + \underset{\begin{matrix} Learning \\ rate \end{matrix}}{\underset{︸}{α}} \underset{TD Error}{\underset{︸}{(\underset{\begin{matrix} TD Target Estimate R_{Σ} \\ Observed reward \end{matrix}}{\underset{︸}{\underset{\begin{matrix} Immediate \\ reward \end{matrix}}{\underset{︸}{r_{k}}} + \underset{\begin{matrix} Discounted value \\ of next state \end{matrix}}{\underset{︸}{γ V^{o l d} (s_{k + 1})}}}} - \underset{\begin{matrix} Former best \\ estimation \end{matrix}}{\underset{︸}{V^{o l d} (s_{k})}})}}

(6)

TD learning serves as the basis for two well-known extensions: Q-learning and SARSA. Equations (7) and (8) represent Q-learning and SARSA, respectively. These extensions further refine TD learning and provide robust methods for updating action values and optimising decision-making policies.

\underset{\begin{matrix} New Q - value \\ estimation \end{matrix}}{\underset{︸}{Q^{n e w} (s_{k}, x_{k})}} = \underset{\begin{matrix} Former best \\ estimation \end{matrix}}{\underset{︸}{Q^{o l d} (s_{k}, x_{k})}} + \underset{\begin{matrix} Learning \\ rate \end{matrix}}{\underset{︸}{α}} \underset{TD Error}{\underset{︸}{(\underset{\begin{matrix} TD Target Estimate R_{Σ} \\ Observed new reward \end{matrix}}{\underset{︸}{\underset{\begin{matrix} Immediate \\ reward for \\ On or Off \\ policy action \end{matrix}}{\underset{︸}{r (s_{k}, x_{k})}} + \underset{\begin{matrix} Discounted estimate \\ of the best known action \\ from x \in X (s_{k + 1}) \end{matrix}}{\underset{︸}{γ {max}_{x} Q (s_{k + 1}, x)}}}} - \underset{\begin{matrix} Former best \\ estimation \end{matrix}}{\underset{︸}{Q^{o l d} (s_{k}, x_{k})}})}}

(7)

\underset{\begin{matrix} New Q - value \\ estimation \end{matrix}}{\underset{︸}{Q^{n e w} (s_{k}, x_{k})}} = \underset{\begin{matrix} Former best \\ estimation \end{matrix}}{\underset{︸}{Q^{o l d} (s_{k}, x_{k})}} + \underset{\begin{matrix} Learning \\ rate \end{matrix}}{\underset{︸}{α}} \underset{TD Error}{\underset{︸}{(\underset{\begin{matrix} TD Target Estimate R_{Σ} \\ Observed new reward \end{matrix}}{\underset{︸}{\underset{\begin{matrix} Immediate \\ reward \\ On - Policy \end{matrix}}{\underset{︸}{r (s_{k}, x_{k})}} + \underset{\begin{matrix} Discounted Estimate \\ optimal Q - value of next state \\ following current policy \end{matrix}}{\underset{︸}{γ Q^{o l d} (s_{k + 1}, x_{k + 1})}}}} - \underset{\begin{matrix} Former best \\ estimation \end{matrix}}{\underset{︸}{Q^{o l d} (s_{k}, x_{k})}})}}

(8)

The following listings are all modelled as a route-based MDP. Addressing the DVRP with stochastic service requests, the work presented in [111] considered early and late request customers. The objective was to maximise the number of solicitations served, requiring the dispatcher to budget time ahead of future requests. The solution process included state-space aggregation and action-space restriction. Preemptive depot returns for a Stochastic Dynamic one-to-many Pickup and Delivery problem (SDPD) were explored in [112]. The authors introduced an Anticipatory Preemptive Depot Return (APDR) strategy, combining ADP and a routing heuristic. The solution approach contained state-space aggregation and action-space restriction. APDR approximated the value of choosing specific delivery requests, considering their impact on future rewards, and increased the number of deliveries per workday. Addressing the problem of estimating arrival times for service vehicle routing with unknown service requests, the authors in [113] proposed a method that anticipates future demands and their impact on arrival times, providing a state-dependent estimate. The approach included state-space aggregation. The proposed technique used offline simulation to learn the values associated with aggregated states. The results demonstrated significant improvements in the service level relative to the benchmarks. A hybrid approach combining offline and online solution methods for DVRP with stochastic requests was introduced in [114]. The authors utilised offline VFA and an online rollout algorithm to create a computationally tractable policy. The policy considered temporal and spatial anticipation of requests. The solution approach included state-space aggregation and action-space restriction.

Addressing the dynamic multi-period vehicle routing problem with stochastic service requests, the authors in [115] used an ADP method to estimate future rewards over the periods. The solution included state-space aggregation and action-space restriction. The study compared four policies, with VFA achieving the best results in most instance settings. Addressing the challenge of cost-efficient same-day delivery, the work presented in [73] introduced dynamic pricing and routing strategies with the Anticipatory Pricing and Routing Policy (APRP) method. The solution method included state-space aggregation and action-space restriction. The approach utilised offline VFA to approximate the opportunity cost for each state and delivery option. APRP outperformed fixed pricing policies and conventional temporal and geographical pricing strategies. The method demonstrated superior performance in revenue and customer service. Addressing the challenges of SDVRPs by combining online rollout algorithms and offline VFA, the authors in [116] provided better approximations of the remaining reward-to-go and more effective solutions. The model included a state-space aggregation approach to handle the complexity of route-based MDPs.

Introducing a new VFA method called Meso-parametric VFA (M-VFA), the work presented in [117] combined Non-parametric VFA (N-VFA) and Parametric VFA (P-VFA) simultaneously. M-VFA outperformed both N-VFA and P-VFA individually, providing a more effective VFA approach for dynamic customer acceptances in delivery routing. The solution process included state-space aggregation and action-space restriction. Addressing the dynamic stochastic electric VRP, the authors in [118] utilised safe RL and a route-based MDP model. Their approach employed VFA with state-space aggregation and action-space restriction to handle stochastic customer requests and energy consumption. Simulation results demonstrated significant energy savings with anticipative route planning and charging. Route-based VFA approaches utilising deep learning techniques, presented in [119,120,121,122,123], are further discussed in Section 4.5.4.

The following listings are all modelled as an assignment-based MDP. Focusing on ambulance redeployment, the work presented in [83] aimed to maximise the number of reached calls within a delay threshold. The model included state-space aggregation and action-space restriction. Addressing dynamic ambulance dispatching and relocation using ADP with VFA, the authors in [84] aimed to minimise the average response time, achieving a 12.89% decrease. The solution method included state-space aggregation and action-space restriction. The work presented in [124] introduced restocking-based rollout policies for the VRP with stochastic demand and duration limits. The model included action-space restriction. Focusing on the DVRP with stochastic travel times and traffic congestion, the authors in [66] proposed a rollout lookahead method that addressed the stochasticity of traffic conditions using the Bellman equation (VFA) to determine the subsequent vehicle’s destination. The approach achieved a 7% improvement over the static solution. Addressing the time-dependent green vehicle routing problem, the authors in [125] incorporated vehicle speed stochasticity to estimate fuel consumption, emissions, and travel costs via VFA.

Introducing a stochastic orienteering problem on a network of queues, the work presented in [79] aimed to maximise the expected rewards accumulated by determining which locations to visit and how long to wait in queues at each location. The authors proposed an ADP approach based on rollout algorithms to solve the problem. Exploring the impact of real-time information on the dynamic dispatching of service vehicles using ADP with VFA, the authors in [97] set the main focus on the best use cases of exogenous and endogenous triggers. The model included state-space aggregation. Addressing the stochastic-dynamic routing problem for bike-sharing systems, the authors in [63] aimed to avoid unsatisfied demand by dynamically relocating bikes during the day. The solution method included state-space aggregation. The proposed dynamic lookahead policy outperformed conventional relocation strategies and lookahead policies with static horizons. Investigating a ride-sharing system utilising an autonomous fleet of electric vehicles, the work presented in [126] proposed a spatio-temporal model that employed a value function to represent demand, considering the impact of current decisions on future demand. The solution approach included state-space aggregation.

Focusing on attended home delivery, the authors in [56] proposed a dynamic incentive mechanism for delivery slot management. The approach utilised an anticipatory decision policy to estimate the marginal fulfilment cost, accounting for future orders. Opposed to ride-sharing, the work presented in [127] used equilibrium inverse RL to imitate passenger-seeking behaviours in a ride-hailing service, employing a value iteration to find the optimal policy. Designing a crowd-sourcing delivery system, the authors in [128] used RL and a sequential decision-making approach for package routing with time constraints and multi-hop delivery. The authors introduced a delivery time estimation module. Real-world data from Shenzhen demonstrated the proposed method’s superiority, achieving a 40% profit rate increase and a 29% increase in delivery rates. Incorporating time estimation in action filtering improved profit and delivery rates by 9% and 8%, respectively. Introducing a learning-based optimisation approach for autonomous ride-sharing platforms, the authors in [129] improved service levels by iteratively improving policies for dispatching and rebalancing. The model considered the spatio-temporal distribution of user service level preferences and the availability of third-party vehicles, making it more robust and adaptive. Assignment-based VFA approaches utilising deep learning techniques, presented in [130,131,132,133], are further discussed in Section 4.5.4.

4.5.2. Policy Function Approximation

Policy Function Approximation (PFA) directly computes the best action in a given state without explicitly estimating the action or state values [8]. Unlike VFA, which focuses on approximating the value of actions within a state, PFA interprets a policy as a probability distribution across the action space. The agent samples actions from this distribution to make decisions guided by policy parameters (

Θ

) that define action selection probabilities in various states.

The fundamental idea behind PFA is to iteratively adjust policy parameters (

Θ

) to enhance the likelihood of selecting actions that result in higher rewards while diminishing the probability of choosing actions leading to lower rewards. PFA simplifies learning by directly updating the policy based on recorded experiences during each learning episode or iteration. As illustrated in Figure 12, the learning loop of PFA initiates and executes following the current policy until a termination condition. As the agent interacts with the environment, it records the observed states, actions, and rewards. PFA then assesses these records to identify actions that yield higher cumulative rewards over time, increasing the probabilities of such rewarding actions while decreasing the probabilities of actions leading to lower cumulative rewards.

However, it is essential to note that PFA has its challenges, including convergence issues and policy sub-optimality. Efficient adaptation with PFA often requires careful tuning and exploration [102]. Nonetheless, PFA is a suitable approach for complex problems, especially when the dimensionality of the action space poses challenges that VFA may not readily address [134].

Exploring the potential of combining parcel pickup stations and autonomous vehicles for same-day delivery, the authors in [80] proposed a PFA approach to make decisions coordinating vehicles. The actions were multifold, including deciding whether to dispatch now or delay until new information arises, the type of parcel to assign to a vehicle, and which station to assign the vehicle to. The model was formulated as a route-based MDP. The work presented in [135] utilised a route-based PFA with deep learning techniques, further discussed in Section 4.5.4.

Exploring the combination of vehicles and drones for same-day delivery operations, the authors in [134] aimed to reduce delivery costs and increase the number of customers served. The model was formulated as an assignment-based MDP and solved via a parametric PFA ADP approach. The PFA used a threshold of a vehicle’s travel time from the depot to split the service area into two zones. Customers within the threshold were preferably served by vehicles, while drones served customers outside the threshold. The PFA was determined via offline simulations, making it runtime-efficient and capable of offering immediate responses to customers.

4.5.3. Actor–Critic

The Actor–Critic (AC) architecture combines VFA and PFA, offering a comprehensive problem-solving approach [136]. In AC, two key components work together:

Actor: The actor is responsible for selecting actions based on the current policy-making decisions guided by the environment’s responses (PFA). The actor aims to learn and execute an optimal policy, continually adapting to maximise cumulative rewards.
Critic: The critic evaluates the actor’s actions by estimating the expected cumulative rewards that result from following the current policy (VFA). The critic’s evaluation provides feedback on the quality and effectiveness of the actor’s actions. Actions that lead to high expected cumulative rewards are considered favourable, while those leading to low expected cumulative rewards are viewed as unfavourable.

Both the actor and the critic components engage in continuous learning:

The actor focuses on learning and improving the optimal policy, adjusting its decision-making strategy based on the environment and the feedback from the critic.
The critic aims to minimise the error between actual observed and approximated values, refining its estimation of long-term rewards, which, in turn, enhances the quality of feedback provided to the actor.

This collaborative learning process ensures that the actor’s policy iteratively improves, driven by the critic’s evaluations. Figure 13 illustrates how an elemental AC architecture functions, with the actor and critic working together to enhance decision-making and policy adaptation.

Utilising an AC architecture, the authors in [137] combined deep learning and multi-agent techniques, further discussed in Section 4.5.5. The work presented in [76] combined an AC architecture with deep learning techniques, further discussed in Section 4.5.4.

4.5.4. Deep Learning

Deep Learning (DL) employs deep learning techniques, particularly neural networks, to approximate value or policy functions [108]. Deep RL (DRL) leverages Deep Neural Networks (DNNs) to estimate these functions, benefitting from the ability of DNNs to represent a wide range of functions in high-dimensional spaces with high performance. A notable application of DRL is the employment of Deep Q-Learning (DQL) [138] to enhance value-based Q-Learning by employing DNNs to predict Q-values for each action in the action space. The DNN is trained to minimise the expected square error between predicted and target Q-values. Similarly, deep approaches extend ADP methods by incorporating DNNs to approximate value functions or policies. By leveraging DNNs, these approaches learn complex mappings between states, actions, and rewards, enabling agents to make more refined decisions. One of the main drawbacks of deep learning approaches is their reliance on substantial training datasets.

The following listings are all modelled as a route-based MDP. Addressing the dynamic vehicle routing problem with time windows, the authors in [119] observed known (deterministic) and stochastic customers and proposed a solution approach called DRLSA (Deep Reinforcement Learning with Simulated Annealing). DRLSA combined DRL with Simulated Annealing to efficiently generate optimised re-routing decisions. Value functions were approximated using neural networks accompanied by state-space aggregation. Addressing the problem of predicting accurate arrival times for restaurant meal deliveries on a delivery platform, the authors in [120] proposed supervised learning methods using Gradient-Boosted Decision Trees (GBDTs) and DNNs to optimise the expected number of served customers and estimate arrival times. The solution method included state-space aggregation and action-space restriction. The work presented in [135] proposed a hierarchical RL approach to efficiently compute solutions for large-scale Dynamic Pickup and Delivery Problems (DPDPs). The method included an action-space restriction. The approach combined two RL modules: an upper-level RL policy using Deep Q-Network (DQN) to segment time windows and transform DPDP into a static PDP, and a lower-level RL policy using Reinforce and Graph Neural Networks (GNNs) to assign orders to vehicles and arrange transportation routes.

Investigating same-day delivery with a heterogeneous fleet of vehicles and drones, the authors in [121] proposed a DQL approach for optimal task assignment to vehicles and drones, considering their speed and capacity. The solution method included an action-space restriction. Investigating a crowd shipping setting using a DRL approach, the work presented in [122] aimed to minimise total cost in a capacitated vehicle fleet, leveraging occasional drivers for deliveries. The solution approach included state-space aggregation and action-space restriction. The method employed a Neural Network (NN) as an estimation for VFA. The novel DRL method handled large instances and incorporated historical data for Monte Carlo simulations. Addressing fair same-day delivery services, the authors in [123] aimed to maximise the minimal regional service rate (fairness) alongside the overall service rate (utility). The problem was formulated as a multi-objective route-based MDP with an action-space restriction. The authors implemented a DQL approach. The proposed method effectively alleviated spatial and temporal unfairness in different customer geographies. The method directly modified the Q-value in RL to include a reward for equal service opportunity across regions, ultimately achieving geographic fairness.

The following listings are all modelled as an assignment-based MDP. Analysing dynamic ride hailing with electric vehicles, the authors in [130] aimed to maximise profit. The solution approach included DRL with DNNs for Q-value approximations. The authors proposed policies to make optimal decisions for vehicle assignment, charging, and repositioning to meet the dynamic demand effectively. The DRL-based policy outperformed reoptimisation-based approaches on New York City real-world data. Focusing on ride-hailing platforms, the work presented in [131] employed a semi-MDP model and DRL for efficient passenger–driver matching. The goal was to leverage MDP and VFA to optimise the order dispatching process. The work presented in [76] investigated a novel DRL framework for the taxi dispatch problem, particularly for autonomous vehicles in a transportation network. The authors propose an AC algorithm with DNNs as estimations for both AC components. The approach used policy gradient methods and a value function to update the parameters of the policy function iteratively. To address dynamic pickup and delivery problems, the authors in [132] proposed a Spatial–Temporal Aided Double Deep Graph Network (ST-DDGN). The method involved forecasting delivery demands using spatial–temporal prediction. Presenting a DRL approach for the meal delivery problem, the authors in [133] considered courier repositioning and order rejection to reduce late orders. The solution method included a Double Deep Q-Network (DDQN) for optimising courier assignments.

4.5.5. Multi-Agent Systems

Multi-Agent Systems (MA) approaches focus on scenarios where multiple agents interact with each other and the environment simultaneously [102]. Each agent aims to learn its own optimal or near-optimal policy while considering the actions and strategies of other agents. The agents’ policies may be competitive or cooperative, depending on the specific problem setting. Agents use techniques like decentralised or centralised control to approximate their value function or policy function and optimise their decision making while accounting for interactions and dependencies among agents [102]. Figure 14 and Figure 15 illustrate a basic example of decentralised and centralised control for learning and execution. The multi-agent approach introduces complexity to the system design as one has to account for multiple agents, their interactions, and influences on individual or joint learning and decision making.

The work presented in [137] proposed the Multi-Agent Routing model using the Deep Attention Mechanisms (MARDAM) model. The approach addressed the dynamic capacitated DVRP with stochastic customers by combining the attention mechanism and actor–critic architecture. MARDAM-integrated fleet state and fleet state representation modules are introduced into the attention mechanism to construct multiple vehicle routes in parallel. The model used an actor–critic approach, where each agent had its policy (actor) to make decisions based on the attention mechanism and a centralised coordinator agent (critic) collected information on states, actions, and rewards from all agents. The critic evaluated the joint cumulative reward for the entire system, enabling the model to make cooperative decisions and penalised individual actions that may lead to suboptimal outcomes for the whole routing problem.

4.6. Non-Reactive Stochastic Sampling Solution Methods

Non-reactive model-free stochastic sampling approaches incorporate possible future events in current decision making by generating scenarios via probability distribution sampling [2]. Unlike more complex methods based on Bellman’s Equations, these model-free approaches are more straightforward but require extensive simulations to account for the problem’s stochastic nature. The key idea is to simulate various future scenarios and sample realisations of random variables to make decisions. Each scenario represents a set of potential outcomes. An advantage of scenario-based approaches is that the generated scenarios are static and deterministic; therefore, they can be solved using established conventional VRP heuristics and metaheuristics. A pivotal method in this domain is the Multiple Scenario Approach (MSA), introduced in [139]. The proposed MSA framework maintains a pool of scenarios containing realisations of random variables and corresponding solutions. The solutions include existing and future events. The proposed online method using MSA makes decisions on the subsequent customer visit by employing a decision function at each decision point. The authors evaluated three decision algorithms: Expectation, Regret, and Consensus, with the latter being deemed the best. One example of a consensus function is to sample the complete pool of scenarios and select the next customer based on the number of occurrences of that customer as the next visit in all scenarios. This simplifies the decision-making process, reducing the overall outcome space while retaining detailed information within the chosen paths. The optimised solution is then directly employed and offers a viable approach to solving online SDVRPs. However, the decision horizon in sampling approaches is limited, and increased sample paths and decision points lead to higher computational effort [6]. Due to the straightforwardness of such methods and the benefits of combining them with established heuristics and metaheuristics, stochastic sampling approaches have a long history of solving SDVRPs.

The work presented in [140] utilised a dynamic stochastic hedging heuristic solution method that generated sample scenarios and solved them via static methods. The approach sampled solutions and derived a plan by combining standard features among these solutions during the initial planning and the online execution horizon. Considering stochastic travel times via stochastic sampling of future events, the authors in [141] utilised insertion heuristics and local improvement procedures. Investigating waiting and relocation strategies, the authors in [142] derived decisions from stochastic information. The authors decided on when and where to wait, directing vehicles towards accepted or sampled customer locations. The work presented in [143] repositioned idle vehicles based on anticipated future demand. The approach derived short-term horizon demands via Monte Carlo sampling procedures. Upon new request arrivals, the method sampled a short-term horizon, evaluated the solutions, and selected the fittest available solution via a decision process.

Utilising an Adaptive Large Neighbourhood Search (ALNS) heuristic, the authors in [144] considered multiple scenarios for future customer requests. Expanding on the MSA, the authors in [145] proposed a jMSA framework. The framework was flexible, parallel, and event driven. The method optimised the generated scenarios via Adaptive Variable Neighbourhood Search (AVNS) heuristics. The work presented in [146] extended and adapted the Variable Neighbourhood Search (VNS) for dynamic environments by sampling for stochastic scenarios. The works presented in [70,147] utilised the MSA with anticipatory metaheuristics and decomposed the problem into sequential static VRPs. The authors integrate randomly sampled future requests into Tabu Search metaheuristics. Investigating the maritime SDVRP and the role of scheduled departure times for vessels, the authors in [148] utilised Tabu Search (TS) and future cargo requests via sample scenarios.

Considering SDVRPs with stochastic travel time matrices, the authors in [67] sampled potential outcomes via scenarios. The work presented in [149] built on the MSA and modelled a dynamic pickup and delivery problem as an assignment-based MDP. At each decision epoch, the approach constructed different scenarios based on the current pre-decision state and random samples of the stochastic parameter set. The consensus function identified then, when waiting at the depot in anticipation of future requests, was beneficial, as well as selecting which requests should be assigned to specific vehicles to serve while stalling decisions about other customers. Considering Peer-To-Peer (P2P) transportation platforms for dynamically matching requests to third-party suppliers, the authors in [150] utilised MSA by sampling potential supplier selections and employed a consensus function to derive decisions.

A substantial body of literature has been dedicated to addressing the challenges of SDVRPs. For further reference on SDVRP approaches utilising (meta-)heuristics and stochastic sampling, one can refer to the extensive collection of studies within the heuristic approaches domain [50,51,52,57,64,65,71,72,151,152,153,154,155,156,157,158,159], as well as the metaheuristic domain [48,53,68,69,70,74,77,78,147,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175].

5. SDVRP Datasets

The Stochastic Dynamic Vehicle Routing Problem (SDVRP) is a complex optimisation challenge in urban logistics and transportation management. Addressing this problem requires the development and testing of solution approaches. While a significant body of research focuses on SDVRP, one notable gap exists—the lack of standardised, widely accepted test instances. Most reviewed articles do not explicitly disclose the utilised dataset, severing the ability for efficient solution approach comparison. To the best of our knowledge, this subsection introduces available open-source datasets found in the SDVRP literature and highlights the need for generating general benchmark instances.

TLC Trip Record Data: The TLC trip record dataset is a comprehensive collection of yellow and green taxi trip records and for-hire vehicle (FHV) trip records in New York City. It includes pick-up and drop-off times, locations, distances, fares, rate types, payment methods, and passenger counts. Covering data from 2009 to the present, it is a valuable resource for urban transportation and logistics analysis. Nevertheless, it is crucial to note that the dataset’s accuracy and completeness are not assured. Researchers should validate the data quality for their specific use case. For a detailed description, refer to [176]. An example of the utilisation of the dataset can be found in the work presented in [177].
2021 Amazon Last Mile Routing Research Challenge Dataset: This dataset, created for the 2021 Amazon Last Mile Routing Research Challenge, is a valuable resource offering a vast real-world collection of routing data with 6112 training routes and 3072 evaluation routes. These routes, anonymously and securely obfuscated, represent Amazon delivery driver activities across five major U.S. metropolitan areas. The dataset encompasses comprehensive features for each route, stop, and package, including IDs, geographical details, temporal details, delivery statuses, time windows, package dimensions, vehicle capacities, service sequences, transit times, route scores, etc. It provides rich insights for route optimisation and logistics analysis. For a detailed description, refer to [178].
Instances for The Same-Day Delivery Problem for Online Purchases: These datasets are derived from Solomon’s and Gehring’s works, offering comprehensive Same-Day Delivery Problem data. The datasets consist of full-day or half-day heterogeneous and homogeneous data. Each dataset includes various parameters, including time window type, geography type, arrival rate, realisation type, and instance number. Furthermore, the datasets encompass different characteristics, including request numbers, coordinates, arrival time, and time window details. For a detailed description, refer to [149].
EURO Meets NeurIPS 2022 Vehicle Routing Competition: This competition considers the vehicle routing problem with time windows (VRPTW) and a dynamic variant in which new orders arrive during the day. The competition dataset included real-world static instances with an explicit duration matrix providing (non-euclidean) real-world road driving times between customers. The number of customers is between 200 and 1000. Each customer has a coordinate, a demand, a time window, and a service duration. For the dynamic variant of the problem, an environment was provided that sampled the locations, demands, service times, and time windows of requests uniformly from the data of a static VRPTW instance. Using the provided data, one may determine the degree of dynamism and employ a probability distribution to generate instances fit for SDVRPs. For a detailed description, refer to [179].

Despite the usefulness of these datasets, it is clear that there are no universally accepted, standardised instances tailored to the SDVRP domain. Researchers often create custom synthetic instances with specific characteristics to reflect their study’s objectives. The development and adoption of benchmark instances would significantly advance the field of SDVRP research, promoting methodological rigour and enabling the comparison of various solution approaches on equal conditions.

6. Discussion

The amount of information in Stochastic Dynamic Vehicle Routing Problems (SDVRPs) results in vast possible states, making the consideration of each state individually impractical. A common technique in the SDVRP literature to handle complex, high-dimensional spaces is aggregation [85]. This technique extracts similarities among states and compresses them into a cluster of representative features. The process of aggregation aids in surmounting the curse of dimensionality but also decreases the granularity of states, effectively impacting the accuracy of approximations. Therefore, for a reliable approximation, it is imperative to maintain a sufficient number of features. The features must allow frequent observations and reliable estimations while emphasising state contrasts to avoid similar values for diverse states [115]. Recent SDVRP approaches have shifted towards utilising neural networks that can interpolate between known states and operate directly on unaggregated states [8].

The explored concept of aggregation in SDVRPs helps capture the problem’s ample state space. It is equally essential to address the vast combinatorial action space. The SDVRP action space usually encompasses two vital elements: assigning subsets of customers for service and vehicle routing [112]. Past approaches have addressed this by restricting the action space via enumeration and coarse state-space aggregation, employing value-based methods like lookup tables. However, this approach may compromise the quality of solutions [180]. More recent methods have explored deep neural networks to handle the vast action space effectively. Furthermore, a common approach is decomposition techniques [85], which break down the problem into top-level and base-level sub-problems. For instance, the top level may involve assigning customers to vehicles, while the base level may involve optimising the service sequences. The top-level sub-problem is examined in detail, while established heuristics solve the base-level sub-problem. This action-space restriction simplifies the complex action space into a series of manageable tasks, allowing for sequential and more straightforward solutions [8]. Moreover, actions can be aggregated, providing a less granular representation of decision-making processes [85].

The provided classification table of reactive (ADP and RL) approaches in SDVRP research (Table 2) observes two distinct types of MDP (Markov Decision Process) methods: Assignment-based MDP (A) and Route-based MDP (R). Each approach targets specific aspects of the problem, focusing on efficient acceptance–rejection and customer–vehicle assignment with or without route information. At the price of increased complexity, the latter provides a comprehensive and versatile model of the SDVRP that incorporates routing information (e.g., routing sequences) into the state space. This inclusion is practical when the solution approach heavily relies on data on planned routes to guide decisions. In advancing the practicability of such a model, future research should account for its increased granularity and devise appropriate solution methods.

Table 2 enriches the classification originally presented in [8] by further categorising the literature based on the underlying architecture approach. Notably, the augmented table indicates a significant absence of specific categories, namely PFA, AC, and MA, indicating opportunities for future research.

Future research in SDVRPs can follow several promising directions:

Policy Function Approximation (PFA): PFAs involve approximating the policy directly without explicitly computing value functions. While VFA methods are prevalent in the table, PFA approaches are notably missing. PFA can offer advantages in handling high-dimensional action spaces and may lead to more efficient and scalable solutions. Future research could explore PFA methods to address the combinatorial nature of the SDVRP action space. This approach could prove beneficial, especially in route-based SDVRPs governed by state-space aggregations and action-space restrictions.
Actor–Critic (AC): AC methods combine policy-based and value-based techniques. Although PFAs and VFAs are individually present in the table, AC methods remain scarce. Investigating AC methods in SDVRP research can improve policy evaluations and more stable learning processes, especially in scenarios with complex state and action spaces.
Multi-Agent Systems (MA): The table primarily includes studies on single-agent SDVRP scenarios. However, real-world settings frequently involve multiple vehicles, leading to a multi-agent decision-making problem. MA approaches facilitate the decentralised evaluation of individual agents’ actions while considering the joint action space. Integrating MA techniques into SDVRP research could lead to scalable solutions that accommodate changes in fleet size and handle real-world instances efficiently.
Deep Learning (DL): As the evaluation of complete state and action spaces of real-world instances is infeasible due to the curse of dimensionality, sacrificing granularity for scalability, researchers have opted for aggregation and decomposition techniques to overcome this problem (presented as State-Space Aggregation (SSA) and Action-Space Restriction (ASR) in Table 2). Recent approaches explore solutions using deep learning, employing DNNs to handle large spaces. DNNs are effective models that can discover data patterns and relationships, making them suitable for addressing the high-dimensional and combinatorial nature of SDVRP decision making. The vital challenge is efficiently translating the DNN policies from training to practical SDVRP instances.

Investigating methods that combine policy-based and value-based approaches, exploring multi-agent settings, and refining existing techniques can enhance efficiency and effectiveness, leading to more sustainable and customer-centric solutions. In doing so, one has to bear in mind the inherent challenge manifested in the time required to generate, evaluate, and enact decisions in large-scale, real-world instances. Efficient solution approaches must fulfil requests for prompt, real-time service imposed by contemporary customers.

Future research in non-reactive stochastic sampling approaches may explore the following directions:

Efficient Sampling Techniques: As stochastic sampling approaches heavily rely on simulations, future research can investigate more efficient sampling techniques to reduce computational effort while maintaining solution accuracy. Techniques that dynamically exclude unlikely or suboptimal states could enhance the efficiency of generating scenarios.
Extended Decision Horizon: A standard pattern among sampling approaches is to limit the sampling horizon to the near future, usually one step ahead. Limiting the horizon enables the method to allocate sufficient resources to optimise the sampled scenarios. Although this suits ad hoc approaches, there are downfalls in long-term rewards. Expanding the sampling horizon to consider longer-term decisions could lead to more informed and beneficial routing decisions.

The presence of dynamism coupled with uncertainty creates an environment where one may not explicitly know the optimal number of vehicles to fulfil requests and maintain service quality through the operation horizon. Minimising the number of service vehicles could make the system more vulnerable to disruptions caused by unexpected events. To maintain operational flexibility and robustness, service providers may apply risk mitigation strategies that deploy more service vehicles than necessary. Employing such tactics comes with an increase in costs. Remaining predominantly idle further amplifies the costs of such vehicles. Still, due to the inherent complexities of applying such minimisation considerations and the research focus preferring customer satisfaction and levels of quality and responsiveness, the SDVRP literature observes no direct research on the subject. As minimisation could lead to cost reductions, environmental benefits, and improved urban logistics efficiency, future research could explore adaptive fleet-sizing strategies that balance cost optimisation with service quality and produce risk-mitigating and resource-efficient solutions. This domain could have applications in emerging urban transportation models like Mobility on Demand (MoD). MoD envisions shared mobility services and could benefit from optimal fleet sizing to reduce costs and environmental impact while offering cost-effective and eco-friendly options.

Due to the absence of standard benchmark instances, researchers often generate synthetic data, introducing variability that makes comparing and generalising results difficult. This data issue underscores the need for standardised test instances to facilitate better representation and analysis of SDVRP methods. With the absence of such benchmarks coupled with the scaling problem of SDVRPs, most researchers opt to define synthetic instances with small- to medium-scale vehicle fleets and customer pools [86].

7. Conclusions

The wide availability of on-demand services, products, and platforms facilitated a shift in contemporary consumer behaviours. Given the abundance of service providers, consumers can promptly shift between products in pursuit of convenience and instant gratification, demanding fast and reliable service. Furthermore, the advent of crowd-sourced applications accompanied by third-party occasional drivers and similar platforms has introduced further challenges and complexities in SDVRPs. Nevertheless, to maintain competitiveness, the service providers must account for the range of consumer-set quality standards. To meet the insatiable requirements, providers must set an imperative on adapting and continuously refining proactive and anticipatory approaches.

In pursuit of identifying key trends in SDVRP solution approaches, this review observed a range of methods combining traditional heuristics and metaheuristics with cutting-edge reactive and non-reactive anticipatory techniques. Vital challenges among the observed approaches manifest in the ample SDVRP state, action, and stochastic information spaces, i.e., the curses of dimensionality. Given the high volumes of complex exogenous and endogenous information present in SDVRPs, it is crucial to efficiently navigate the underlying learning processes to accommodate scaling and volatile environments. Sacrificing granularity for scalability, researchers have opted for aggregation and decomposition techniques to overcome these problems. Recent approaches explore solutions using deep learning, as it can alleviate curses of dimensionality while preserving the relative granularity of information. However, in the scope of this research, we observed that addressing real-world SDVRPs with a comprehensive resolution encounters a set of challenges, emphasising a substantial gap in the research field that warrants further exploration. Alongside refining techniques for innovative data patterns and relationship derivations, it is vital to acknowledge the necessity for efficient explorative data sampling techniques. It remains an integral challenge to dynamically and efficiently navigate the stochastic process of generating future scenarios and efficiently incorporating the knowledge into cutting-edge models.

In the scope of this research, it was found that the following approaches have not been sufficiently studied in the SDVRP domain: methods that combine policy-based and value-based techniques and multi-agent settings. Research into these avenues may enhance efficiency and effectiveness, leading to more sustainable and customer-centric solutions. Furthermore, due to the absence of standard benchmark instances, researchers often generate synthetic data, introducing variability that makes comparing and generalising results difficult. This data issue underscores the need for standardised test instances to facilitate better representation and analysis of SDVRP methods. With the absence of such benchmarks coupled with the scaling problem of SDVRPs, most researchers opt to define synthetic instances with small- to medium-scale vehicle fleets and customer pools.

Author Contributions

Conceptualisation, N.M. and T.E.; methodology, N.M.; formal analysis, N.M.; investigation, N.M.; resources, T.C., T.E. and N.M.; data curation, N.M.; writing—original draft preparation, N.M.; writing—review and editing, N.M., T.E., T.C. and M.Đ.; visualisation, N.M.; supervision, T.E., T.C. and M.Đ.; project administration, T.E. and T.C.; funding acquisition, T.E. and T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been partially supported by the project Routing of electric vehicles on a time-dependent road network founded by the Postdoctoral Research Support Program at the Faculty of Transport and Traffic Sciences, University of Zagreb.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analysed in this study. This data can be found here: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page; https://registry.opendata.aws/amazon-last-mile-challenges/; https://iro.uiowa.edu/esploro/outputs/dataset/Instances-for-The-Same-Day-Delivery-Problem/9983557554202771; https://euro-neurips-vrp-2022.challenges.ortec.com/.

Acknowledgments

During the preparation of this work, the authors used ChatGPT in order to improve readability and language. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AC	Actor–Critic
ADP	Approximate Dynamic Programming
ANN	Artificial Neural Network
ASP	Algorithm Selection Problem
ASR	Action-Space Restriction
DARP	Dial-A-Ride-Problem
DD	Dynamic and Deterministic
DDVRP	Deterministic Dynamic Vehicle Routing Problem
DL	Deep Learning
DNN	Deep Neural Network
DOD	Degree Of Dynamism
DP	Dynamic Programming
DQL	Deep Q-Learning
DQN	Deep Q-Network
DRL	Deep Reinforcement Learning
DS	Dynamic and Stochastic
DTSP	Dynamic Travelling Salesman Problem
DVRP	Dynamic Vehicle Routing Problem
EDOD	Effective Degree Of Dynamism
LS	Local Search
MA	Multi-Agent Systems
MDP	Markov Decision Process
MoD	Mobility on Demand
MSA	Multiple Scenario Approach
PFA	Policy Function Approximation
RH	Rolling Horizon
RL	Reinforcement Learning
RO	ReOptimisation
SARSA	State-Action-Reward-State-Action
SD	Static and Deterministic
SDVRP	Stochastic Dynamic Vehicle Routing Problem
SS	Static and Stochastic
SSA	State-Space Aggregation
TD	Temporal Difference
TSP	Travelling Salesman Problem
VFA	Value Function Approximation
VRP	Vehicle Routing Problem

References

Ehmke, J.F. Integration of Information and Optimization Models for Routing in City Logistics; Number 978-1-4614-3628-7 in International Series in Operations Research and Management Science; Springer: New York, NY, USA, 2012. [Google Scholar] [CrossRef]
Pillac, V.; Gendreau, M.; Guéret, C.; Medaglia, A. A review of dynamic vehicle routing problems. Eur. J. Oper. Res. 2013, 225, 1–11. [Google Scholar] [CrossRef]
Ritzinger, U.; Puchinger, J.; Hartl, R.F. A survey on dynamic and stochastic vehicle routing problems. Int. J. Prod. Res. 2016, 54, 215–231. [Google Scholar] [CrossRef]
Psaraftis, H.N.; Wen, M.; Kontovas, C.A. Dynamic vehicle routing problems: Three decades and counting. Networks 2016, 67, 3–31. [Google Scholar] [CrossRef]
Rios, B.; Xavier, E.; Miyazawa, F.; Amorim, P.; Ferian Curcio, E.; Santos, M. Recent dynamic vehicle routing problems: A survey. Comput. Ind. Eng. 2021, 160, 107604. [Google Scholar] [CrossRef]
Soeffker, N.; Ulmer, M.W.; Mattfeld, D.C. Stochastic dynamic vehicle routing in the light of prescriptive analytics: A review. Eur. J. Oper. Res. 2022, 298, 801–820. [Google Scholar] [CrossRef]
Zhang, J.; Woensel, T.V. Dynamic vehicle routing with random requests: A literature review. Int. J. Prod. Econ. 2023, 256, 108751. [Google Scholar] [CrossRef]
Hildebrandt, F.D.; Thomas, B.W.; Ulmer, M.W. Opportunities for reinforcement learning in stochastic dynamic vehicle routing. Comput. Oper. Res. 2023, 150, 106071. [Google Scholar] [CrossRef]
Ghiani, G.; Guerriero, F.; Laporte, G.; Musmanno, R. Real-time vehicle routing: Solution concepts, algorithms and parallel computing strategies. Eur. J. Oper. Res. 2003, 151, 1–11. [Google Scholar] [CrossRef]
Laporte, G. The vehicle routing problem: An overview of exact and approximate algorithms. Eur. J. Oper. Res. 1992, 59, 345–358. [Google Scholar] [CrossRef]
Psaraftis, H.N. A Dynamic Programming Solution to the Single Vehicle Many-to-Many Immediate Request Dial-a-Ride Problem. Transp. Sci. 1980, 14, 130–154. [Google Scholar] [CrossRef]
Dantzig, G.B.; Ramser, J.H. The Truck Dispatching Problem. Manag. Sci. 1959, 6, 80–91. [Google Scholar] [CrossRef]
Eksioglu, B.; Vural, A.V.; Reisman, A. The vehicle routing problem: A taxonomic review. Comput. Ind. Eng. 2009, 57, 1472–1483. [Google Scholar] [CrossRef]
Fisher, M.L. Optimal Solution of Vehicle Routing Problems Using Minimum K-Trees. Oper. Res. 1994, 42, 626–642. [Google Scholar] [CrossRef]
Ropke, M.I.S.; Cordeau, J.F.; Vigo, D. Branch-and-cut-and price for the capacitated vehicle routing problem with two-dimensional loading constraints. Proc. ROUTE 2007. [Google Scholar] [CrossRef]
Eilon, S.; Watson-Gandy, C.D.T.; Christofides, N.; de Neufville, R. Distribution Management-Mathematical Modelling and Practical Analysis. IEEE Trans. Syst. Man, Cybern. 1974, 21, 589. [Google Scholar] [CrossRef]
Gheysens, F.; Golden, B.; Assad, A. A comparison of techniques for solving the fleet size and mix vehicle routing problem. Oper.-Res.-Spektrum 1984, 6, 207–216. [Google Scholar] [CrossRef]
Bräysy, O.; Gendreau, M. Vehicle routing problem with time windows, Part I: Route construction and local search algorithms. Transp. Sci. 2005, 39, 104–118. [Google Scholar] [CrossRef]
Clarke, G.; Wright, J.W. Scheduling of Vehicles from a Central Depot to a Number of Delivery Points. Oper. Res. 1964, 12, 568–581. [Google Scholar] [CrossRef]
Rosenkrantz, D.J.; Stearns, R.E.; Lewis, P.M., II. An Analysis of Several Heuristics for the Traveling Salesman Problem. SIAM J. Comput. 1977, 6, 563–581. [Google Scholar] [CrossRef]
Vidal, T.; Crainic, T.G.; Gendreau, M.; Prins, C. Heuristics for multi-attribute vehicle routing problems: A survey and synthesis. Eur. J. Oper. Res. 2013, 231, 1–21. [Google Scholar] [CrossRef]
Erdelić, T.; Carić, T. A Survey on the Electric Vehicle Routing Problem: Variants and Solution Approaches. J. Adv. Transp. 2019, 2019, 5075671. [Google Scholar] [CrossRef]
Savelsbergh, M. The vehicle routing problem with time windows: Minimizing route duration. ORSA J. Comput. 1992, 4, 146–154. [Google Scholar] [CrossRef]
Or, I. Traveling Salesman Type Combinatorial Problems and their Relation to the Logistics of Regional Blood Banking. Ph.D. Thesis, Northwestern University, Evanston, IL, USA, 1976. [Google Scholar]
Lin, S. Computer solutions of the traveling salesman problem. Bell Syst. Tech. J. 1965, 44, 2245–2269. [Google Scholar] [CrossRef]
Fosin, J.; Carić, T.; Ivanjko, E. Vehicle Routing Optimization Using Multiple Local Search Improvements. Automatika 2014, 55, 124–132. [Google Scholar] [CrossRef]
Holland, J.H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence; The MIT Press: Cambridge, MA, USA, 1992. [Google Scholar] [CrossRef]
Resende, M.; Ribeiro, C.; Glover, F.; Marti, R. Scatter Search and Path-Relinking: Fundamentals, Advances, and Applications. In Handbook of Metaheuristics; Springer: Cham, Switzerland, 2010; pp. 87–107. [Google Scholar] [CrossRef]
Dorigo, M.; Stützle, T. Ant Colony Optimization; The MIT Press: Cambridge, MA, USA, 2004. [Google Scholar] [CrossRef]
Marinakis, Y.; Marinaki, M. Bumble bees mating optimization algorithm for the vehicle routing problem. In Handbook of Swarm Intelligence. Adaptation, Learning, and Optimization; Springer: Berlin/Heidelberg, Germany, 2011; Volume 8, pp. 347–369. [Google Scholar]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar] [CrossRef]
Marinakis, Y.; Marinaki, M. A hybrid genetic—Particle Swarm Optimization Algorithm for the vehicle routing problem. Expert Syst. Appl. 2010, 37, 1446–1455. [Google Scholar] [CrossRef]
Gendreau, M.; Potvin, J.Y. (Eds.) Handbook of Metaheuristics, 2nd ed.; Springer: New York, NY, USA, 2010. [Google Scholar]
Kirkpatrick, S.; Gelatt, C.D.; Vecchi, M.P. Optimization by Simulated Annealing. Science 1983, 220, 671–680. [Google Scholar] [CrossRef] [PubMed]
Černý, V. Thermodynamical Approach to the Traveling Salesman Problem: An Efficient Simulation Algorithm. J. Optim. Theory Appl. 1985, 45, 41–51. [Google Scholar] [CrossRef]
Glover, F. Tabu search—Part I. INFORMS J. Comput. 1990, 2, 4–32. [Google Scholar] [CrossRef]
Glover, F. Tabu search—Part II. ORSA J. Comput. 1990, 2, 4–32. [Google Scholar] [CrossRef]
Mladenović, N.; Hansen, P. Variable neighborhood search. Comput. Oper. Res. 1997, 24, 1097–1100. [Google Scholar] [CrossRef]
Lourenço, H.R.; Martin, O.C.; Stützle, T. Iterated Local Search: Framework and Applications. In Handbook of Metaheuristics; Gendreau, M., Potvin, J.Y., Eds.; Springer: Boston, MA, USA, 2010; pp. 363–397. [Google Scholar] [CrossRef]
Ropke, S.; Pisinger, D. An Adaptive Large Neighborhood Search Heuristic for the Pickup and Delivery Problem with Time Windows. Transp. Sci. 2006, 40, 455–472. [Google Scholar] [CrossRef]
Pisinger, D.; Ropke, S. A general heuristic for vehicle routing problems. Comput. Oper. Res. 2007, 34, 2403–2435. [Google Scholar] [CrossRef]
Mayer, T.; Uhlig, T.; Rose, O. Simulation-based Autonomous Algorithm Selection for Dynamic Vehicle Routing Problems with the Help of Supervised Learning Methods. In Proceedings of the 2018 Winter Simulation Conference (WSC), Gothenburg, Sweden, 9–12 December 2018. [Google Scholar] [CrossRef]
Jakobović, D.; Đurasević, M.; Brkić, K.; Fosin, J.; Carić, T.; Davidović, D. Evolving Dispatching Rules for Dynamic Vehicle Routing with Genetic Programming. Algorithms 2023, 16, 285. [Google Scholar] [CrossRef]
Jacobsen-Grocott, J.; Mei, Y.; Chen, G.; Zhang, M. Evolving heuristics for Dynamic Vehicle Routing with Time Windows using genetic programming. In Proceedings of the 2017 IEEE Congress on Evolutionary Computation (CEC), Donostia, Spain, 5–8 June 2017; pp. 1948–1955. [Google Scholar] [CrossRef]
Gala, F.J.G.; Ðurasević, M.; Jakobović, D. Genetic Programming for Electric Vehicle Routing Problem with Soft Time Windows. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO’22), Boston, MA, USA, 9–13 July 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 542–545. [Google Scholar] [CrossRef]
Gil-Gala, F.J.; Afsar, S.; Durasevic, M.; Palacios, J.J.; Afsar, M. Genetic Programming for the Vehicle Routing Problem with Zone-Based Pricing. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’23), Lisbon, Portugal, 15–19 July 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1118–1126. [Google Scholar] [CrossRef]
Ichoua, S.; Gendreau, M.; Potvin, J.Y. Planned Route Optimization For Real-Time Vehicle Routing. In Dynamic Fleet Management: Concepts, Systems, Algorithms & Case Studies; Zeimpekis, V., Tarantilis, C.D., Giaglis, G.M., Minis, I., Eds.; Springer: Boston, MA, USA, 2007; pp. 1–18. [Google Scholar] [CrossRef]
Larsen, A.; Madsen, O.; Solomon, M. Partially dynamic vehicle routing—Models and algorithms. J. Oper. Res. Soc. 2002, 53, 637–646. [Google Scholar] [CrossRef]
Lund, K.; Madsen, O.; Rygaard, J. Vehicle Routing Problems with Varying Degrees of Dynamism; IMM Technical Report 1/96; Institute of Mathematical Modelling, Technical University of Denmark: Copenhagen, Denmark, 1996. [Google Scholar]
Bertsimas, D.J.; van Ryzin, G. A Stochastic and Dynamic Vehicle Routing Problem in the Euclidean Plane. Oper. Res. 1991, 39, 601–615. [Google Scholar] [CrossRef]
Bertsimas, D.J.; van Ryzin, G. Stochastic and Dynamic Vehicle Routing in the Euclidean Plane with Multiple Capacitated Vehicles. Oper. Res. 1993, 41, 60–76. [Google Scholar] [CrossRef]
Bertsimas, D.J.; van Ryzin, G. Stochastic and Dynamic Vehicle Routing with General Demand and Interarrival Time Distributions. Adv. Appl. Probab. 1993, 25, 947–978. [Google Scholar] [CrossRef]
Binart, S.; Dejax, P.; Gendreau, M.; Semet, F. A 2-stage method for a field service routing problem with stochastic travel and service times. Comput. Oper. Res. 2016, 65, 64–75. [Google Scholar] [CrossRef]
Gendreau, M.; Guertin, F.; Potvin, J.Y.; Séguin, R. Neighborhood search heuristics for a dynamic vehicle dispatching problem with pick-ups and deliveries. Transp. Res. Part C Emerg. Technol. 2006, 14, 157–174. [Google Scholar] [CrossRef]
Buhrkal, K.; Larsen, A.; Ropke, S. The Waste Collection Vehicle Routing Problem with Time Windows in a City Logistics Context. Procedia Soc. Behav. Sci. 2012, 39, 241–254. [Google Scholar] [CrossRef]
Vinsensius, A.; Wang, Y.; Chew, E.P.; Lee, L.H. Dynamic Incentive Mechanism for Delivery Slot Management in E-Commerce Attended Home Delivery. Transp. Sci. 2020, 54, 565–853. [Google Scholar] [CrossRef]
Riley, C.; van Hentenryck, P.; Yuan, E. Real-Time Dispatching of Large-Scale Ride-Sharing Systems: Integrating Optimization, Machine Learning, and Model Predictive Control. arXiv 2020, arXiv:2003.10942v1. [Google Scholar]
Toth, P.; Vigo, D. (Eds.) The Vehicle Routing Problem; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2001. [Google Scholar]
Afsar, H.M.; Afsar, S.; Palacios, J.J. Vehicle routing problem with zone-based pricing. Transp. Res. Part E Logist. Transp. Rev. 2021, 152, 102383. [Google Scholar] [CrossRef]
Ulmer, M.W.; Mattfeld, D.C.; Hennig, M.; Goodson, J.C. A Rollout Algorithm for Vehicle Routing with Stochastic Customer Requests. In Logistics Management; Mattfeld, D., Spengler, T., Brinkmann, J., Grunewald, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 217–227. [Google Scholar]
Wen, M.; Cordeau, J.F.; Laporte, G.; Larsen, J. The dynamic multi-period vehicle routing problem. Comput. Oper. Res. 2010, 37, 1615–1623. [Google Scholar] [CrossRef]
Alinaghian, M.; Aghaei, M.; Sabbagh, M. A Mathematical Model for Location of Temporary Relief Centers and Dynamic Routing of Aerial Rescue Vehicles. Comput. Ind. Eng. 2019, 131, 227–241. [Google Scholar] [CrossRef]
Brinkmann, J.; Ulmer, M.W.; Mattfeld, D.C. Dynamic Lookahead Policies for Stochastic-Dynamic Inventory Routing in Bike Sharing Systems. Comput. Oper. Res. 2019, 106, 260–279. [Google Scholar] [CrossRef]
Bopardikar, S.; Srivastava, V. Dynamic Vehicle Routing in Presence of Random Recalls. IEEE Control. Syst. Lett. 2020, 4, 37–42. [Google Scholar] [CrossRef]
Klapp, M.A.; Erera, A.L.; Toriello, A. The Dynamic Dispatch Waves Problem for same-day delivery. Eur. J. Oper. Res. 2018, 271, 519–534. [Google Scholar] [CrossRef]
Kim, G.; Ong, Y.S.; Cheong, T.; Tan, P.S. Solving the Dynamic Vehicle Routing Problem Under Traffic Congestion. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2367–2380. [Google Scholar] [CrossRef]
Köster, F.; Ulmer, M.W.; Mattfeld, D.C.; Hasle, G. Anticipating emission-sensitive traffic management strategies for dynamic delivery routing. Transp. Res. Part D. Transp. Environ. 2018, 62, 345–361. [Google Scholar] [CrossRef]
Sabar, N.R.; Bhaskar, A.; Chung, E.; Turky, A.; Song, A. A self-adaptive evolutionary algorithm for dynamic vehicle routing problems with traffic congestion. Swarm Evol. Comput. 2019, 44, 1018–1027. [Google Scholar] [CrossRef]
Sheridan, P.K.; Gluck, E.; Guan, Q.; Pickles, T.; Balcıoğlu, B.; Benhabib, B. The dynamic nearest neighbor policy for the multi-vehicle pick-up and delivery problem. Transp. Res. A Part Policy Pract. 2013, 49, 178–194. [Google Scholar] [CrossRef]
Ferrucci, F.; Bock, S. A general approach for controlling vehicle en-route diversions in dynamic vehicle routing problems. Transp. Res. Part B Methodol. 2015, 77, 76–87. [Google Scholar] [CrossRef]
Sayarshad, H.R.; Chow, J.Y. A scalable non-myopic dynamic dial-a-ride and pricing problem. Transp. Res. Part B. Methodol. 2015, 81, 539–554. [Google Scholar] [CrossRef]
Swihart, M.R.; Papastavrou, J.D. A stochastic and dynamic model for the single-vehicle pick-up and delivery problem. Eur. J. Oper. Res. 1999, 114, 447–464. [Google Scholar] [CrossRef]
Ulmer, M.W. Dynamic Pricing and Routing for Same-Day Delivery. Transp. Sci. 2019, 54, 855–1152. [Google Scholar] [CrossRef]
Steever, Z.; Karwan, M.; Murray, C. Dynamic Courier Routing for a Food Delivery Service. Comput. Oper. Res. 2019, 107, 173–188. [Google Scholar] [CrossRef]
Alabbasi, A.; Ghosh, A.; Aggarwal, V. DeepPool: Distributed Model-Free Algorithm for Ride-Sharing Using Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2019, 20, 4714–4727. [Google Scholar] [CrossRef]
Mao, C.; Liu, Y.; Shen, M. Dispatch of autonomous vehicles for taxi services: A deep reinforcement learning approach. Transp. Res. Part C Emerg. Technol. 2020, 115, 102626. [Google Scholar] [CrossRef]
Okulewicz, M.; Mańdziuk, J. A metaheuristic approach to solve Dynamic Vehicle Routing Problem in continuous search space. Swarm Evol. Comput. 2019, 48, 44–61. [Google Scholar] [CrossRef]
Yu, G.; Yang, Y. Dynamic routing with real-time traffic information. Oper. Res. 2019, 19, 1033–1058. [Google Scholar] [CrossRef]
Zhang, S.; Ohlmann, J.; Thomas, B.W. Dynamic Orienteering on a Network of Queues. Transp. Sci. 2017, 52, 497–737. [Google Scholar] [CrossRef]
Ulmer, M.W.; Streng, S. Same-Day Delivery with Pickup Stations and Autonomous Vehicles. Comput. Oper. Res. 2019, 108, 1–19. [Google Scholar] [CrossRef]
Santos, D.; Xavier, E. Taxi and Ride Sharing: A Dynamic Dial-a-Ride Problem with Money as an Incentive. Expert Syst. Appl. 2015, 42, 6728–6737. [Google Scholar] [CrossRef]
Zhou, M.; Jin, J.; Zhang, W.; Qin, Z.T.; Jiao, Y.; Wang, C.; Wu, G.; Yu, Y.; Ye, J. Multi-Agent Reinforcement Learning for Order-dispatching via Order-Vehicle Distribution Matching. In Proceedings of the CIKM ’19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 2645–2653. [Google Scholar] [CrossRef]
Maxwell, M.; Restrepo, M.; Henderson, S.; Topaloglu, H. Approximate Dynamic Programming for Ambulance Redeployment. INFORMS J. Comput. 2010, 22, 266–281. [Google Scholar] [CrossRef]
Schmid, V. Solving the dynamic ambulance relocation and dispatching problem using approximate dynamic programming. Eur. J. Oper. Res. 2012, 219, 611–621. [Google Scholar] [CrossRef] [PubMed]
Ulmer, M.W. Approximate Dynamic Programming for Dynamic Vehicle Routing; Number 978-1-4614-3628-7 in International Series in Operations Research and Management Science; Springer: Cham, Switzerland, 2017. [Google Scholar] [CrossRef]
Zhang, J.; Luo, K.; Florio, A.M.; Van Woensel, T. Solving large-scale dynamic vehicle routing problems with stochastic requests. Eur. J. Oper. Res. 2023, 306, 596–614. [Google Scholar] [CrossRef]
Pureza, V.; Laporte, G. Waiting and Buffering Strategies for the Dynamic Pickup and Delivery Problem with Time Windows. INFOR Inf. Syst. Oper. Res. 2008, 46, 165–176. [Google Scholar] [CrossRef]
Ulmer, M.W.; Goodson, J.C.; Mattfeld, D.C.; Thomas, B.W. On modeling stochastic dynamic vehicle routing problems. EURO J. Transp. Logist. 2020, 9, 100008. [Google Scholar] [CrossRef]
Christiansen, M.; Fagerholt, K.; Rachaniotis, N.; Stålhane, M. Operational planning of routes and schedules for a fleet of fuel supply vessels. Transp. Res. Part E Logist. Transp. Rev. 2017, 105, 163–175. [Google Scholar] [CrossRef]
Schyns, M. An Ant Colony System for Responsive Dynamic Vehicle Routing. Eur. J. Oper. Res. 2015, 245, 704–718. [Google Scholar] [CrossRef]
Ferrucci, F.; Bock, S. Real-time control of express pickup and delivery processes in a dynamic environment. Transp. Res. Part B Methodol. 2014, 63, 1–14. [Google Scholar] [CrossRef]
Ichoua, S.; Gendreau, M.; Potvin, J.Y. Diversion Issues in Real-Time Vehicle Dispatching. Transp. Sci. 2000, 34, 426–438. [Google Scholar] [CrossRef]
Gendreau, M.; Guertin, F.; Potvin, J.Y.; Taillard, E. Parallel Tabu Search for Real-Time Vehicle Routing and Dispatching. Transp. Sci. 1999, 33, 381–390. [Google Scholar] [CrossRef]
Mitrović-Minić, S.; Krishnamurti, R.; Laporte, G. Double-horizon based heuristics for the dynamic pickup and delivery problem with time windows. Transp. Res. Part B Methodol. 2004, 38, 669–685. [Google Scholar] [CrossRef]
Hanshar, F.; Ombuki-Berman, B. Dynamic vehicle routing using genetic algorithms. Appl. Intell. 2007, 27, 89–99. [Google Scholar] [CrossRef]
Dunnett, S.; Leigh, J.; Jackson, L. Optimising police dispatch for incident response in real time. J. Oper. Res. Soc. 2018, 70, 269–279. [Google Scholar] [CrossRef]
Ulmer, M.W.; Heilig, L.; Voss, S. On the Value and Challenge of Real-Time Information in Dynamic Dispatching of Service Vehicles. Bus. Inf. Syst. Eng. 2017, 59, 161–171. [Google Scholar] [CrossRef]
Bertsekas, D.P. Dynamic Programming and Optimal Control, 3rd ed.; Athena Scientific: Belmont, MA, USA, 2005; Volume 1. [Google Scholar]
Branke, J.; Middendorf, M.; Noeth, G.; Dessouky, M. Waiting Strategies for Dynamic Vehicle Routing. Transp. Sci. 2005, 39, 298–312. [Google Scholar] [CrossRef]
Gendreau, M.; Laporte, G.; Semet, F. A dynamic model and parallel tabu search heuristic for real-time ambulance relocation. Parallel Comput. 2001, 27, 1641–1653. [Google Scholar] [CrossRef]
Bellman, R. A Markovian Decision Process. J. Math. Mech. 1957, 6, 679–684. [Google Scholar] [CrossRef]
Powell, W.B. Reinforcement Learning and Stochastic Optimization; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2022. [Google Scholar]
Powell, W.B. Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd ed.; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
Littman, M. Markov Decision Processes. In International Encyclopedia of the Social & Behavioral Sciences; Smelser, N.J., Baltes, P.B., Eds.; Pergamon: Oxford, UK, 2001; pp. 9240–9242. [Google Scholar] [CrossRef]
Bertsekas, D.P. Reinforcement Learning and Optimal Control; Athena Scientific Optimization and Computation Series; Athena Scientific: Nashua NH, USA, 2019. [Google Scholar]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Ou, X.; Chang, Q.; Chakraborty, N. A Method Integrating Q-Learning with Approximate Dynamic Programming for Gantry Work Cell Scheduling. IEEE Trans. Autom. Sci. Eng. 2021, 18, 85–93. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; A Bradford Book: Cambridge, MA, USA, 2018. [Google Scholar]
Brunton, S.L.; Kutz, J.N. Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control, 1st ed.; Cambridge University Press: Cambridge, MA, USA, 2019. [Google Scholar]
Powell, W.B. What You Should Know About Approximate Dynamic Programming. Nav. Res. Logist. 2009, 56, 239–249. [Google Scholar] [CrossRef]
Ulmer, M.W.; Mattfeld, D.; Köster, F. Budgeting Time for Dynamic Vehicle Routing with Stochastic Customer Requests. Transp. Sci. 2015, 52, 1–227. [Google Scholar] [CrossRef]
Ulmer, M.W.; Thomas, B.W.; Mattfeld, D. Preemptive Depot Returns for Dynamic Same-Day Delivery. EURO J. Transp. Logist. 2018, 8, 327–361. [Google Scholar] [CrossRef]
Ulmer, M.W.; Thomas, B.W. Enough Waiting for the Cable Guy—Estimating Arrival Times for Service Vehicle Routing. Transp. Sci. 2018, 53, 623–916. [Google Scholar]
Ulmer, M.W.; Goodson, J.; Mattfeld, D.; Hennig, M. Offline-Online Approximate Dynamic Programming for Dynamic Vehicle Routing with Stochastic Requests. Transp. Sci. 2018, 53, 1–318. [Google Scholar] [CrossRef]
Ulmer, M.W.; Soeffker, N.; Mattfeld, D. Value Function Approximation for Dynamic Multi-Period Vehicle Routing. Eur. J. Oper. Res. 2018, 269, 883–899. [Google Scholar] [CrossRef]
Ulmer, M.W. Horizontal combinations of online and offline approximate dynamic programming for stochastic dynamic vehicle routing. Cent. Eur. J. Oper. Res. 2020, 28, 279–308. [Google Scholar] [CrossRef]
Ulmer, M.W.; Thomas, B.W. Meso-parametric value function approximation for dynamic customer acceptances in delivery routing. Eur. J. Oper. Res. 2020, 285, 183–195. [Google Scholar] [CrossRef]
Basso, R.; Kulcsár, B.; Sanchez-Diaz, I.; Qu, X. Dynamic stochastic electric vehicle routing with safe reinforcement learning. Transp. Res. Part Logist. Transp. Rev. 2022, 157, 102496. [Google Scholar] [CrossRef]
Joe, W.; Lau, H.C. Deep Reinforcement Learning Approach to Solve Dynamic Vehicle Routing Problem with Stochastic Customers. In Proceedings of the International Conference on Automated Planning and Scheduling, Nancy, France, 26–30 October 2020; Volume 30, pp. 394–402. [Google Scholar] [CrossRef]
Hildebrandt, F.D.; Ulmer, M.W. Supervised Learning for Arrival Time Estimations in Restaurant Meal Delivery. Transp. Sci. 2021, 56, 799–1110. [Google Scholar] [CrossRef]
Chen, X.; Ulmer, M.W.; Thomas, B.W. Deep Q-learning for same-day delivery with vehicles and drones. Eur. J. Oper. Res. 2022, 298, 939–952. [Google Scholar] [CrossRef]
Silva, M.; Pedroso, J.P.; Viana, A. Deep reinforcement learning for stochastic last-mile delivery with crowdshipping. EURO J. Transp. Logist. 2023, 12, 100105. [Google Scholar] [CrossRef]
Chen, X.; Wang, T.; Thomas, B.W.; Ulmer, M.W. Same-day delivery with fair customer service. Eur. J. Oper. Res. 2023, 308, 738–751. [Google Scholar] [CrossRef]
Goodson, J.; Thomas, B.W.; Ohlmann, J. Restocking-Based Rollout Policies for the Vehicle Routing Problem with Stochastic Demand and Duration Limits. Transp. Sci. 2015, 50, 150818112639001. [Google Scholar] [CrossRef]
Çimen, M.; Soysal, M. Time-dependent green vehicle routing problem with stochastic vehicle speeds: An approximate dynamic programming algorithm. Transp. Res. Part D Transp. Environ. 2017, 54, 82–98. [Google Scholar] [CrossRef]
Al-Kanj, L.; Nascimento, J.; Powell, W.B. Approximate dynamic programming for planning a ride-hailing system using autonomous fleets of electric vehicles. Eur. J. Oper. Res. 2020, 284, 1088–1106. [Google Scholar] [CrossRef]
Oda, T. Equilibrium Inverse Reinforcement Learning for Ride-Hailing Vehicle Network. In Proceedings of the Web Conference 2021 (WWW’21), Ljubljana, Slovenia, 19–23 April 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 2281–2290. [Google Scholar] [CrossRef]
Ding, Y.; Guo, B.; Zheng, L.; Lu, M.; Zhang, D.; Wang, S.; Son, S.H.; He, T. A City-Wide Crowdsourcing Delivery System with Reinforcement Learning. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2021, 5, 97. [Google Scholar] [CrossRef]
Beirigo, B.; Schulte, F.; Negenborn, R. A Learning-Based Optimization Approach for Autonomous Ridesharing Platforms with Service Level Contracts and On-Demand Hiring of Idle Vehicles. Transp. Sci. 2022, 56, 567–798. [Google Scholar] [CrossRef]
Kullman, N.; Cousineau, M.; Goodson, J.; Mendoza, J. Dynamic Ride-Hailing with Electric Vehicles. Transp. Sci. 2020, 56, 567–798. [Google Scholar] [CrossRef]
Qin, Z.T.; Tang, X.; Jiao, Y.; Zhang, F.; Xu, Z.; Zhu, H.; Ye, J. Ride-Hailing Order Dispatching at DiDi via Reinforcement Learning. Interface 2020, 50, 272–286. [Google Scholar] [CrossRef]
Li, X.; Luo, W.; Yuan, M.; Wang, J.; Lu, J.; Wang, J.; Lu, J.; Zeng, J. Learning to Optimize Industry-Scale Dynamic Pickup and Delivery Problems. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; IEEE Computer Society: Los Alamitos, CA, USA, 2021; pp. 2511–2522. [Google Scholar] [CrossRef]
Jahanshahi, H.; Bozanta, A.; Cevik, M.; Kavuk, E.M.; Tosun, A.; Sonuc, S.B.; Kosucu, B.; Başar, A. A deep reinforcement learning approach for the meal delivery problem. Knowl.-Based Syst. 2022, 243, 108489. [Google Scholar] [CrossRef]
Ulmer, M.W.; Thomas, B.W. Same-Day Delivery with a Heterogeneous Fleet of Drones and Vehicles. Networks 2018, 72, 475–505. [Google Scholar] [CrossRef]
Ma, Y.; Hao, X.; Hao, J.; Lu, J.; Liu, X.; Tong, X.; Yuan, M.; Li, Z.; Tang, J.; Meng, Z. A Hierarchical Reinforcement Learning Based Optimization Framework for Large-scale Dynamic Pickup and Delivery Problems. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; NIPS: San Diego, CA, USA, 2021. [Google Scholar]
Konda, V.R.; Tsitsiklis, J.N. On actor-critic algorithms. SIAM J. Control. Optim. 2003, 42, 1143–1166. [Google Scholar] [CrossRef]
Bono, G.; Dibangoye, J.S.; Simonin, O.; Matignon, L.; Pereyron, F. Solving Multi-Agent Routing Problems Using Deep Attention Mechanisms. IEEE Trans. Intell. Transp. Syst. 2021, 22, 7804–7813. [Google Scholar] [CrossRef]
Yang, Z.; Xie, Y.; Wang, Z. A Theoretical Analysis of Deep Q-Learning. arXiv 2019, arXiv:1901.00137. [Google Scholar]
Bent, R.W.; van Hentenryck, P. Scenario-Based Planning for Partially Dynamic Vehicle Routing with Stochastic Customers. Oper. Res. 2004, 52, 977–987. [Google Scholar] [CrossRef]
Hvattum, L.M.; Løkketangen, A.; Laporte, G. Solving a Dynamic and Stochastic Vehicle Routing Problem with a Sample Scenario Hedging Heuristic. Transp. Sci. 2006, 40, 421–438. [Google Scholar] [CrossRef]
Potvin, J.Y.; Xu, Y.; Benyahia, I. Vehicle routing and scheduling with dynamic travel times. Comput. Oper. Res. 2006, 33, 1129–1137. [Google Scholar] [CrossRef]
Bent, R.W.; van Hentenryck, P. Waiting and Relocation Strategies in Online Stochastic Vehicle Routing. In Proceedings of the IJCAI’07: Proceedings of the 20th international Joint Conference on Artifical intelligence, Hyderabad, India, 6–12 January 2007; pp. 1816–1821. [Google Scholar]
Ghiani, G.; Manni, E.; Quaranta, A.; Triki, C. Anticipatory algorithms for same-day courier dispatching. Transp. Res. Part E Logist. Transp. Rev. 2009, 45, 96–106. [Google Scholar] [CrossRef]
Azi, N.; Gendreau, M.; Potvin, J.Y. A dynamic vehicle routing problem with multiple delivery routes. Ann. Oper. Res. 2010, 199, 103–112. [Google Scholar] [CrossRef]
Pillac, V.; Guéret, C.; Medaglia, A. An event-driven optimization framework for dynamic vehicle routing. Decis. Support Syst. 2012, 54, 414–423. [Google Scholar] [CrossRef]
Sarasola, B.; Doerner, K.; Schmid, V.; Alba, E. Variable neighborhood search for the stochastic and dynamic vehicle routing problem. Ann. Oper. Res. 2015, 236, 425–461. [Google Scholar] [CrossRef]
Ferrucci, F.; Bock, S. Pro-active real-time routing in applications with multiple request patterns. Eur. J. Oper. Res. 2016, 253, 356–371. [Google Scholar] [CrossRef]
Tirado, G.; Hvattum, L.M. Determining departure times in dynamic and stochastic maritime routing and scheduling problems. Flex. Serv. Manuf. J. 2017, 29, 553–571. [Google Scholar] [CrossRef]
Voccia, S.A.; Campbell, A.M.; Thomas, B.W. The Same-Day Delivery Problem for Online Purchases. Transp. Sci. 2019, 53, 167–184. [Google Scholar] [CrossRef]
Ausseil, R.; Pazour, J.; Ulmer, M.W. Supplier Menus for Dynamic Matching in Peer-to-Peer Transportation Platforms. Transp. Sci. 2021, 56, 1111–1408. [Google Scholar] [CrossRef]
Papastavrou, J.D. A stochastic and dynamic routing policy using branching processes with state dependent immigration. Eur. J. Oper. Res. 1996, 95, 167–177. [Google Scholar] [CrossRef]
Yang, J.; Jaillet, P.; Mahmassani, H. Real-Time Multivehicle Truckload Pickup and Delivery Problems. Transp. Sci. 2004, 38, 135–148. [Google Scholar] [CrossRef]
Hvattum, L.M.; Løkketangen, A.; Laporte, G. A branch-and-regret heuristic for stochastic and dynamic vehicle routing problems. Networks 2007, 49, 330–340. [Google Scholar] [CrossRef]
Xiang, Z.; Chu, C.; Chen, H. The study of a dynamic dial-a-ride problem under time dependent stochastic environments. Eur. J. Oper. Res. 2008, 185, 534–551. [Google Scholar] [CrossRef]
Pavone, M.; Bisnik, N.; Frazzoli, E.; Isler, V. A Stochastic and Dynamic Vehicle Routing Problem with Time Windows and Customer Impatience. Mob. Netw. Appl. 2009, 14, 350–364. [Google Scholar] [CrossRef]
Lorini, S.; Potvin, J.Y.; Zufferey, N. Online vehicle routing and scheduling with dynamic travel times. Comput. Oper. Res. 2011, 38, 1086–1090. [Google Scholar] [CrossRef]
Vodopivec, N.; Miller-Hooks, E. An optimal stopping approach to managing travel-time uncertainty for time-sensitive customer pickup. Transp. Res. Part B Methodol. 2017, 102, 22–37. [Google Scholar] [CrossRef]
Billing, C.; Jaehn, F.; Wensing, T. A multiperiod auto-carrier transportation problem with probabilistic future demands. J. Bus. Econ. 2018, 88, 1009–1028. [Google Scholar] [CrossRef]
Hyland, M.; Mahmassani, H. Dynamic autonomous vehicle fleet operations: Optimization-based strategies to assign AVs to immediate traveler demand requests. Transp. Res. Part C Emerg. Technol. 2018, 92, 278–297. [Google Scholar] [CrossRef]
Gendreau, M.; Laporte, G.; Séguin, R. A Tabu Search Heuristic for the Vehicle Routing Problem with Stochastic Demands and Customers. Oper. Res. 1996, 44, 469–477. [Google Scholar] [CrossRef]
Secomandi, N. Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands. Comput. Oper. Res. 2000, 27, 1201–1225. [Google Scholar] [CrossRef]
Secomandi, N. A Rollout Policy for the Vehicle Routing Problem with Stochastic Demands. Oper. Res. 2001, 49, 796–802. [Google Scholar] [CrossRef]
Haghani, A.; Jung, S. A dynamic vehicle routing problem with time-dependent travel times. Comput. Oper. Res. 2005, 32, 2959–2986. [Google Scholar] [CrossRef]
Ichoua, S.; Gendreau, M.; Potvin, J.Y. Exploiting Knowledge About Future Demands for Real-Time Vehicle Dispatching. Transp. Sci. 2006, 40, 211–225. [Google Scholar] [CrossRef]
Pavone, M.; Frazzoli, E.; Bullo, F. Decentralized algorithms for stochastic and dynamic vehicle routing with general demand distribution. In Proceedings of the 2007 46th IEEE Conference on Decision and Control, New Orleans, LA, USA, 12–14 December 2007; pp. 4869–4874. [Google Scholar] [CrossRef]
Novoa, C.; Storer, R. An approximate dynamic programming approach for the vehicle routing problem with stochastic demands. Eur. J. Oper. Res. 2009, 196, 509–515. [Google Scholar] [CrossRef]
Pavone, M.; Frazzoli, E.; Bullo, F. Adaptive and Distributed Algorithms for Vehicle Routing in a Stochastic and Dynamic Environment. IEEE Trans. Autom. Control 2011, 56, 1259–1274. [Google Scholar] [CrossRef]
Ferrucci, F.; Bock, S.; Gendreau, M. A pro-active real-time control approach for dynamic vehicle routing problems dealing with the delivery of urgent goods. Eur. J. Oper. Res. 2013, 225, 130–141. [Google Scholar] [CrossRef]
Schilde, M.; Doerner, K.; Hartl, R. Integrating stochastic time-dependent travel speed in solution methods for the dynamic dial-a-ride problem. Eur. J. Oper. Res. 2014, 238, 18–30. [Google Scholar] [CrossRef]
Albareda-Sambola, M.; Fernandez, E.; Laporte, G. The dynamic multiperiod vehicle routing problem with probabilistic information. Comput. Oper. Res. 2014, 48, 31–39. [Google Scholar] [CrossRef]
Archetti, C.; Savelsbergh, M.; Speranza, M.G. The Vehicle Routing Problem with Occasional Drivers. Eur. J. Oper. Res. 2016, 254, 472–480. [Google Scholar] [CrossRef]
Tirado, G.; Hvattum, L.M. Improved solutions to dynamic and stochastic maritime pick-up and delivery problems using local search. Ann. Oper. Res. 2017, 253, 825–843. [Google Scholar] [CrossRef]
Zou, H.; Dessouky, M. A look-ahead partial routing framework for the stochastic and dynamic vehicle routing problem. J. Veh. Routing Algorithms 2018, 1, 73–88. [Google Scholar] [CrossRef]
Klein, R.; Mackert, J.; Neugebauer, M.; Steinhardt, C. A model-based approximation of opportunity cost for dynamic pricing in attended home delivery. OR Spectrum Quant. Approaches Manag. 2018, 40, 969–996. [Google Scholar] [CrossRef]
Wang, F.; Liao, F.; Li, Y.; Yan, X.; Chen, X. An ensemble learning based multi-objective evolutionary algorithm for the dynamic vehicle routing problem with time windows. Comput. Ind. Eng. 2021, 154, 107131. [Google Scholar] [CrossRef]
NYC Taxi and Limousine Commission (TLC) Trip Record Data. Available online: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page (accessed on 30 November 2023).
Bertsimas, D.; Jaillet, P.; Martin, S. Online Vehicle Routing: The Edge of Optimization in Large-Scale Applications. Oper. Res. 2019, 67, 1–294. [Google Scholar] [CrossRef]
Merchán, D.; Arora, J.; Pachon, J.; Konduri, K.; Winkenbach, M.; Parks, S.; Noszek, J. 2021 Amazon Last Mile Routing Research Challenge: Data Set. Transp. Sci. 2022; online ahead of print. [Google Scholar] [CrossRef]
EURO Meets NeurIPS 2022 Vehicle Routing Competition. Available online: https://euro-neurips-vrp-2022.challenges.ortec.com/ (accessed on 30 November 2023).
Hildebrandt, F.D.; Thomas, B.W.; Ulmer, M.W. Where the Action is: Let’s make Reinforcement Learning for Stochastic Dynamic Vehicle Routing Problems work! arXiv 2021, arXiv:2103.00507. [Google Scholar]

Figure 1. Route planning of two vehicles.

Figure 2. Example of static and deterministic vehicle routing.

Figure 3. Example of static and stochastic vehicle routing.

Figure 4. Example of dynamic and deterministic vehicle routing.

Figure 5. Example of dynamic vehicle routing with stochastic requests.

Figure 6. Taxonomy overview. Reprinted with permission from Ref. [5]. 2023, Nikola Mardešić.

Figure 7. Lookahead policy example.

Figure 8. Canonical MDP elements.

Figure 9. SDVRP MDP block diagram example, adapted from [6] (https://creativecommons.org/licenses/by/4.0/ CC BY 4.0) (accessed on 4 December 2023).

Figure 10. Model-based example of the MDP.

Figure 11. SDVRP MDP model variants. Reprinted with permission from Ref. [88]. 2023, Nikola Mardešić, (a) Assignment-based SDVRP MDP, (b) Route-based SDVRP MDP.

Figure 12. PFA learning episode.

Figure 13. Basic AC architecture.

Figure 14. Decentralised MA.

Figure 15. Centralised MA.

Table 1. VRP classification. Reprinted with permission from Ref. [2]. 2023, Nikola Mardešić.

		Information Quality
		Deterministic Input	Stochastic Input
Information evolution	Input known before route planning (route remains constant)	Static and Deterministic	Static and Stochastic
Information evolution	Input changes during route execution (route adapts to changing conditions)	Dynamic and Deterministic	Dynamic and Stochastic

Table 2. ADP and RL SDVRP classification.

Reference	MDP	VFA	PFA	AC	DL	MA	SSA	ASR
[83] Maxwell et al. (2010)	A	✓					✓	✓
[84] Schmid (2012)	A	✓					✓	✓
[124] Goodson et al. (2015)	A	✓						✓
[66] Kim et al. (2016)	A	✓
[125] Çimen & Soysal (2017)	A	✓
[79] Zhang et al. (2017)	A	✓
[97] Ulmer et al. (2017)	A	✓					✓
[134] Ulmer & Thomas (2018)	A		✓
[63] Brinkmann et al. (2019)	A	✓					✓
[126] Al-Kanj et al. (2020)	A	✓					✓
[56] Vinsensius et al. (2020)	A	✓
[130] Kullman et al. (2020)	A	✓			✓
[131] Qin et al. (2020)	A	✓			✓
[76] Mao et al. (2020)	A	✓	✓	✓	✓
[127] Oda (2021)	A	✓
[128] Ding et al. (2021)	A	✓
[132] Li et al. (2021)	A	✓			✓
[137] Bono et al. (2021)	A	✓	✓	✓	✓	✓		✓
[129] Beirigo et al. (2022)	A	✓
[133] Jahanshahi et al. (2022)	A	✓			✓
[111] Ulmer et al. (2015)	R	✓					✓	✓
[112] Ulmer et al. (2018)	R	✓					✓	✓
[113] Ulmer & Thomas (2018)	R	✓					✓
[114] Ulmer et al. (2018)	R	✓					✓	✓
[115] Ulmer et al. (2018)	R	✓					✓	✓
[73] Ulmer (2019)	R	✓					✓	✓
[80] Ulmer & Streng (2019)	R		✓
[116] Ulmer (2020)	R	✓					✓
[117] Ulmer & Thomas (2020)	R	✓					✓	✓
[119] Joe & Lau (2020)	R	✓			✓		✓
[135] Ma et al. (2021)	R		✓		✓			✓
[120] Hildebrandt & Ulmer (2021)	R	✓			✓		✓	✓
[121] Chen et al. (2022)	R	✓			✓			✓
[118] Basso et al. (2022)	R	✓					✓	✓
[122] Silva et al. (2023)	R	✓			✓		✓	✓
[123] Chen et al. (2023)	R	✓			✓			✓
[86] Zhang et al. (2023)	R						✓	✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mardešić, N.; Erdelić, T.; Carić, T.; Đurasević, M. Review of Stochastic Dynamic Vehicle Routing in the Evolving Urban Logistics Environment. Mathematics 2024, 12, 28. https://doi.org/10.3390/math12010028

AMA Style

Mardešić N, Erdelić T, Carić T, Đurasević M. Review of Stochastic Dynamic Vehicle Routing in the Evolving Urban Logistics Environment. Mathematics. 2024; 12(1):28. https://doi.org/10.3390/math12010028

Chicago/Turabian Style

Mardešić, Nikola, Tomislav Erdelić, Tonči Carić, and Marko Đurasević. 2024. "Review of Stochastic Dynamic Vehicle Routing in the Evolving Urban Logistics Environment" Mathematics 12, no. 1: 28. https://doi.org/10.3390/math12010028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Review of Stochastic Dynamic Vehicle Routing in the Evolving Urban Logistics Environment

Abstract

1. Introduction

1.1. Recent Literature Reviews and Scientific Contribution

1.2. Organisation of This Paper

2. Vehicle Routing Problem

2.1. Evolution and Quality of Information

2.2. Conventional Solution Methods

2.2.1. Exact Algorithms

2.2.2. Heuristic Algorithms

2.2.3. Metaheuristic Algorithms

2.2.4. Hyper-Heuristic Algorithms

3. Dynamic Vehicle Routing Problem

3.1. Degree of Dynamism

3.2. DVRP Taxonomy

3.2.1. Fleet Size

3.2.2. Time Constraints

3.2.3. Vehicle Capacity Constraints

3.2.4. The Ability to Reject Customers

3.2.5. Sources of Dynamism

3.2.6. Sources of Stochasticity

3.2.7. Objective Functions

3.3. DDVRP Solution Methods

3.3.1. Reoptimisation Approaches

3.3.2. Strategies Emulating Effective Decision Making

4. SDVRP Solution Methods

4.1. SDVRP Policies

4.2. Markov Decision Process

4.3. Model-Based vs. Model-Free Reactive Approach

4.3.1. Model-Based Approach

4.3.2. Model-Free Approach

4.3.3. Prominent Frameworks

4.4. Approximation in the Scaling SDVRP Environment

4.5. Model-Induced Reactive Solution Methods

4.5.1. Value Function Approximation

4.5.2. Policy Function Approximation

4.5.3. Actor–Critic

4.5.4. Deep Learning

4.5.5. Multi-Agent Systems

4.6. Non-Reactive Stochastic Sampling Solution Methods

5. SDVRP Datasets

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI