On the Impact of the Rules on Autonomous Drive Learning

Talamini, Jacopo; Bartoli, Alberto; De Lorenzo, Andrea; Medvet, Eric

doi:10.3390/app10072394

Open AccessArticle

On the Impact of the Rules on Autonomous Drive Learning

Department of Engineering and Architecture, University of Trieste, 34127 Trieste, Italy

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2020, 10(7), 2394; https://doi.org/10.3390/app10072394

Submission received: 17 February 2020 / Revised: 19 March 2020 / Accepted: 25 March 2020 / Published: 1 April 2020

(This article belongs to the Special Issue Intelligent Transportation Systems: Beyond Intelligent Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

Autonomous vehicles raise many ethical and moral issues that are not easy to deal with and that, if not addressed correctly, might be an obstacle to the advent of such a technological revolution. These issues are critical because autonomous vehicles will interact with human road users in new ways and current traffic rules might not be suitable for the resulting environment. We consider the problem of learning optimal behavior for autonomous vehicles using Reinforcement Learning in a simple road graph environment. In particular, we investigate the impact of traffic rules on the learned behaviors and consider a scenario where drivers are punished when they are not compliant with the rules, i.e., a scenario in which violation of traffic rules cannot be fully prevented. We performed an extensive experimental campaign, in a simulated environment, in which drivers were trained with and without rules, and assessed the learned behaviors in terms of efficiency and safety. The results show that drivers trained with rules enforcement are willing to reduce their efficiency in exchange for being compliant to the rules, thus leading to higher overall safety.

Keywords:

Reinforcement Learning; self-driving vehicles; traffic rules

1. Introduction

In recent years, autonomous vehicles have attracted a lot of interest from both industrial and research groups [1,2]. The reasons for this growth are the technological advancement in the automotive field, the availability of faster computing units, and the increasing diffusion of the so-called Internet of Things. Autonomous vehicles collect a huge amount of data from the vehicle and from the outside environment, and are capable of processing these data in real-time to assist decision-making on the road. The amount of collected information and the need for real-time computing make the design of the driving algorithms a complex task to carry out with traditional techniques. Moreover, the sources of information may be noisy or may provide ambiguous information that could therefore negatively affect the outcome of the driving algorithm. The combination of these factors makes it very hard, if not unfeasible, to define the driver behavior by developing a set of hand-crafted rules. On the other side, the huge amount of data available can be leveraged by suitable machine learning techniques. The rise of deep learning in the last decade has proven its power in many fields, including self-driving cars development, and enabled the development of machines that take actions based on images collected by a front camera as the only source of information [3], or even using a biological inspired event-driven camera [4].

The use of simulations and synthetic data [5] for training have allowed to assess neural networks capabilities in many different realistic environments and different degrees of complexity. Many driving simulators have been designed, from the low-level ones that allow the drivers to control the hand brake of their car [6], to higher-level ones, in which the drivers can control their car acceleration and lane-change [7]. Some simulators model the traffic in an urban road network [8], some others model car’s intersection access [9,10,11,12], or roundabout insertion [13].

In a near future scenario, the first autonomous vehicles on the roads will have to make decisions in a mixed traffic environment. Autonomous vehicles will have to be able to cope with radically different road agents, i.e., agents powered by machines capable of processing information way more quickly than human drivers and human drivers that could occasionally take unexpected actions. There will hardly be a single authority to control each car in a centralized fashion and thus every autonomous vehicle will have to take decisions on its own, treating all the other road agents as part of the environment. It may very well be the case that current traffic rules do not fit a scenario with self-driving cars.

In this work, we investigate to which extent the traffic rules affect the drivers optimization process. The problem of finding the optimal driving behavior subjected to some traffic rules is highly relevant because it provides a way to define allowed behaviors for autonomous drivers, possibly without the need to manually craft those behaviors. A first approach for solving this problem consists of defining hard constraints on driver behavior and replacing forbidden actions with fallback ones [14]. Such an approach leads to drivers which are not explicitly aware of the rules. If those hard constraints were removed, driver behavior could change in unpredictable ways. Another approach consists in punishing behaviors that are not compliant with the rules, thus discouraging drivers from taking those behaviors again. In this work, we investigate this second approach based on punishing undesired behaviors. In this scenario, drivers have to learn the optimal behavior that balances a trade-off between being compliant with the rules and driving fast while avoiding collisions. A scenario in which drivers have the chance of breaking the rules is particularly relevant because it could address the complex ethics issues regarding self-driving cars in a more flexible way (those issues are fully orthogonal to our work, however).

We perform the optimization of the self-driving controllers using Reinforcement Learning (RL), which is a powerful framework used to find the optimal policy for a given task according to a trial-and-error paradigm. In this framework, we consider the possibility of enforcing traffic rules directly into the optimization process, as part of the reward function. Experimental results show that it is therefore possible to reduce unwanted behaviors with such approach.

2. Related Works

The rise of Reinforcement Learning (RL) [15] as an optimization framework for learning artificial agents, and the outstanding results of its combination with neural networks [16], have recently reached many new grounds, becoming a promising technique for the automation of driving tasks. Deep learning advances have proved that a neural network is highly effective in automatically extracting relevant features from raw data [17], as well as allowing an autonomous vehicle to take decisions based on information provided by a camera [3,4]. However, these approaches may not capture the complexity of planning decisions or predicting other drivers’ behavior, and their underlying supervised learning approach could be unable to cope with multiple complex sub-problems at once, including sub-problems not relevant to the driving task itself [18]. There are thus many reasons to consider a RL self-driving framework, which can tackle driving problems by interacting with an environment and learning from experience [18].

An example of an autonomous driving task implementation, based on Inverse Reinforcement Learning (IRL), was proposed by Sharifzadeh et al. [5]. The authors claimed that, in such a large state space task as driving, IRL can be effective in extracting the reward signal, using driving data from experts demonstrations. End-to-end low-level control through a RL driver was done by Jaritz et al. [6], in a simulated environment, based on the racing game TORCS, in which the driver has to learn full control of its car, that is steering, brake, gas, and even hand brake to enforce drifting. Autonomous driving is a challenging task for RL because it needs to ensure functional safety and every driver has to deal with the potentially unpredictable behavior of others [14]. One of the most interesting aspects of autonomous driving is learning how to efficiently cross an intersection, which requires providing suitable information on the intersection to the RL drivers [9], as well as correctly negotiating the access with other non-learning drivers and observing their trajectory [10,11]. Safely accessing to an intersection is a challenging task for RL drivers, due to the nature of the intersection itself, which may be occluded, and possible obstacles might not be clearly visible [12]. Another interesting aspect for RL drivers is learning to overtake other cars, which can be a particularly challenging task, depending on the shape of the road section in which the cars are placed [19], but also depending on the vehicles size, as in [20], where a RL driver learns to control a truck-trailer vehicle in an highway with other regular cars. The authors of [21,22] provided extensive classifications of the AI state-of-the-art techniques employed in autonomous driving, together with the degrees of automation that are possible for self-driving cars.

Despite the engineering advancements in designing self-driving cars, a lack of legal framework for these vehicles might slow down their coming [23]. There are also important ethical and social considerations. It has been proposed to address the corresponding issues as an engineering problem, by translating them into algorithms to be handled by the embedded software of a self-driving car [24]. This way the solution of a moral dilemma should be calculated based on a given set of rules or other mechanisms—although the exact practical details and, most importantly, their corresponding implications, are unclear. The problem of autonomous vehicles regulation is particularly relevant in mixed-traffic scenarios, as stated by Nyholm and Smids [25] and Kirkpatrick [26], as human drivers may behave in unpredictable ways to the machines. This problem could be mitigated by providing human drivers with more technological devices to help them drive more similar to robotic drivers, but mixed traffic ethics certainly introduce much deeper and more difficult problems [25].

A formalization of traffic rules for autonomous vehicles was provided by Rizaldi and Althoff [27], according to which a vehicle is not responsible for a collision if satisfying all the rules while colliding. Another driving automation approach based on mixed traffic rules is proposed in [28], where the rules are inspired by current traffic regulation. Traffic rules synthesis could even be automated, as proposed in [29], where a set of rules is evolved to ensure traffic efficiency and safety. The authors considered rules expressed by means of a language generated from a Backus–Naur Form grammar [30], but other ways to express spatiotemporal properties have been proposed [31,32]. Given the rules, the task of automatically finding the control strategy for robotics systems with safety rules is considered in [33], where the agents have to solve the task while minimizing the number of violated rules. AI safety can be inspired by humans, who intervene on agents in order to prevent unsafe situations, and then by training an algorithm to imitate the human intervention [34], thus reducing the amount of human labour required. A different strategy is followed by [35], where the authors defined a custom set of traffic rules based on the environment, the driver, and the road graph. With these rules, a RL driver learns to safely make lane-changing decisions, where the driver’s decision making is combined with the formal safety verification of the rules, to ensure that only safe actions are taken by the driver A similar approach is considered in [7], where the authors replaced the formal safety verification with a learnable safety belief module, as part of the driver’s policy.

3. Model

We consider a simple road traffic scenario in the form of a directed graph where the road sections are edges, and the intersections are vertices. Each road element is defined by continuous linear space in the direction of its length, and an integer number of lanes. In this scenario, a fixed number of cars move on the road graph according to their driver decisions for a given number of discrete time steps.

3.1. Road Graph

A road graph is a directed graph

G = (S, I)

in which edges E represent road sections, and vertices I represent road intersections. Each road element

p \in G

is connected to the next elements

n (p) \subset G

, with

n (p) \neq Ø

. Edges are straight one-way roads with one or more lanes. For each edge p, it holds that

n (p) \subset I

. Vertices can be either turns or crossroads, have exactly one lane, and are used to connect road sections. For each vertex p it holds that

n (p) \subset S

, and

| n (p) | = 1

. Every road element

p \in G

is defined by its length

l (p) \in R^{+}

, and its number of lanes

w (p) \in N, w > 0

. We do not take into account traffic lights or roundabouts in this scenario.

3.2. Cars

A car simulates a real vehicle that moves on the road graph G: its position can be determined at any time of the simulation in terms of the currently occupied road element and current lane. The car movement is determined in terms of two speeds—the linear speed along the road element and the lane-changing speed along the lanes of the same element. At each time step, the car state is defined by the tuple

(p, x, y, v_{x}, v_{y}, s)

, where

p \in {S, I}

is the current road element,

x \in [0, l (p)]

is the position on the road element,

y \in {1, \dots, w (p)}

is the current lane,

v_{x} \in [0, v_{\max}]

is the linear speed,

v_{y} \in {- 1, 0, 1}

is the lane-changing speed, and

s \in {alive, dead}

is the status (time reference is omitted for brevity). All the cars have the same length

l_{car}

and the same maximum speed

v_{\max}

.

At the beginning of a simulation, all cars are placed uniformly among the road sections, on all the lanes, ensuring that a minimum distance exists between cars

i, j

on the same road element

p_{i} = p_{j}

, such that:

| x_{i} - x_{j} | > x_{gap}

. The initial speeds for all the cars are

v_{x} = v_{y} = 0

, and the status is

s = alive

.

At the next time steps, if the status of a car is

s = dead

, the position is not updated. Otherwise, if the status is

s = alive

, the position of a car is updated as follows. Let

(a_{x}^{(t)}, a_{y}^{(t)}) \in {- 1, 0, 1} \times {- 1, 0, 1}

be the driver action composed, respectively, of

a_{x}^{(t)}

accelerating action and

a_{y}^{(t)}

lane-changing action (see details below). The linear speed and the lane-changing speed at time

t + 1

are updated accordingly with the driver action

(a_{x}^{(t)}, a_{y}^{(t)})

at time t as:

\begin{matrix} v_{x}^{(t + 1)} & = min (v_{\max}, max (v_{x}^{(t)} + a_{x}^{(t)} a_{\max} Δ t, 0)) \end{matrix}

(1)

\begin{matrix} v_{y}^{(t + 1)} & = a_{y}^{(t)} \end{matrix}

(2)

where

a_{\max}

is the intensity of the instant acceleration and

Δ t

is the discrete time step duration. The car linear position on the road graph at time

t + 1

is updated as:

x^{(t + 1)} = \{\begin{matrix} x^{(t)} + v_{x}^{(t + 1)} Δ t & if v_{x}^{(t + 1)} Δ t \leq x_{stop}^{(t)} \\ v_{x}^{(t + 1)} Δ t - x_{stop}^{(t)} & otherwise \end{matrix}

(3)

where

x_{stop}

is the distance ahead to the next road element, and is computed as:

x_{stop}^{(t + 1)} = l (p^{(t + 1)}) - x^{(t + 1)}

(4)

The car lane position at time

t + 1

is updated as:

y^{(t + 1)} = min (w (p^{(t + 1)}), max (y^{(t)} + v_{y}^{(t + 1)}, 1))

(5)

The road element at time

t + 1

is computed as:

p^{(t + 1)} = \{\begin{matrix} p^{(t)} & if v_{x}^{(t)} Δ t \leq x_{stop}^{(t)} \\ \sim U (n (p^{(t)})) & otherwise \end{matrix}

(6)

where U is the uniform distribution over the next road elements coming from p In other words, when exiting from an intersection, a car enters an intersection chosen randomly from

n (p^{(t)})

.

Two cars collide, if the distance between them is smaller than the cars length

l_{car}

. In particular, for any cars

(p, x, y, v_{x}, v_{y}, s)

,

(p^{'}, x^{'}, y^{'}, v_{x}^{'}, v_{y}^{'}, s^{'})

, the status at time

t + 1

is updated as (we omit the time superscript for readability):

s = \{\begin{matrix} dead & if (p = p^{'} \land | x - x^{'} | < l_{car}) \lor (p^{'} \in n (p) \land x_{stop} + x^{'} < l_{car}) \\ alive & otherwise \end{matrix}

(7)

When a collision occurs, we simulate an impact by giving the leading car a positive acceleration of intensity

a_{coll}

, while giving the following car a negative acceleration of intensity

- a_{coll}

, for the next

t_{coll}

time steps. Collided cars are kept in the simulation for the next

t_{dead} > t_{coll}

time steps of the simulation, thus acting as obstacles for the alive ones.

3.3. Drivers

A driver is an algorithm that is associated to a car. Each driver is able to sense part of its car variables and information from the road environment, and takes driving actions that affect its car state. Every driver ability to see obstacles on the road graph is limited to the distance of view

d_{view}

.

3.3.1. Observation

For the driver of a car

(p, x, y, v_{x}, v_{y}, s)

, the set of visible cars in the jth relative lane, with

j \in {- 1, 0, 1}

, is the union of the set

V_{same, j}

of cars that are in the same segment and the same or adjacent lane and the set

V_{next}

of cars that are in one of the next segments

p^{'} \in n (p)

, in both cases with a distance shorter than

d_{view}

:

\begin{matrix} V_{same, j} & = \{(p^{'}, x^{'}, y^{'}, v_{x}^{'}, v_{y}^{'}, s^{'}) : p^{'} = p \land 0 < x^{'} - x \leq d_{view} \land y^{'} = y + j\} \end{matrix}

(8)

\begin{matrix} V_{next} & = \{(p^{'}, x^{'}, y^{'}, v_{x}^{'}, v_{y}^{'}, s^{'}) : p^{'} \in n (p) \land x_{stop} + x^{'} \leq d_{view}\} \end{matrix}

(9)

We remark that the set of cars

V_{j} = V_{same, j} \cup V_{next}

includes also the cars in the next segments: the current car is hence able to perceive cars in a intersection, when in a segment, or in the connected sections, when in an intersection, provided that they are closer than

d_{view}

.

The driver’s observation is based on the concept of jth lane closest car

c_{j}^{closest}

, based on the set

V_{j}

defined above. For each driver,

c_{j}^{closest}

is the closest one in

V_{j}

:

c_{j}^{closest} = \{\begin{matrix} \underset{(p^{'}, x^{'}, y^{'}, v_{x}^{'}, v_{y}^{'}, s^{'}) \in V_{j}}{arg min} 1 (p^{'} = p) (x^{'} - x) + 1 (p^{'} \neq p) (x_{stop} + x^{'}) & if V_{j} \neq Ø \\ Ø & otherwise \end{matrix}

(10)

where

V_{j} = V_{same, j} \cup V_{next}

and

1 : {false, true} \to {0, 1}

is the indicator function. Figure 1 illustrates two different examples of jth lane closest car, with

j = 0

. We can see that the

c_{j}^{closest}

might not exist for some j, either if there is no car closer than

d_{view}

or if there is no such jth lane.

We define the closeness variables

δ_{x, j} \in [0, d_{view}]

, with

j \in {- 1, 0, 1}

, as the distances to the jth lane closest cars

c_{j}^{closest}

, if any, or

d_{view}

, otherwise. Similarly, we define the relative speed variables

δ_{v, j} \in [- v_{\max}, v_{\max}]

, with

j \in {- 1, 0, 1}

, as the speed difference of the current car with respect to the jth lane closest cars

c_{j}^{closest}

, if any, or

v_{\max}

, otherwise.

At each time step of the simulation, each driver observes the distance from its car to the next road element, indicated by

x_{stop}

, the current lane y, the current linear speed

v_{x}

, the status of its vehicle s, the road element type

e = 1 (p \in S)

its car is currently on, the closeness variables

δ_{x, j}

, and the relative speed variable

δ_{v, j}

. We define each driver observation as:

o = (x_{stop}, y, v_{x}, s, e, δ_{x, - 1}, δ_{x, 0}, δ_{x, 1}, δ_{v, - 1}, δ_{v, 0}, δ_{v, 1})

, therefore

o \in O = [0, l_{\max}] \times {1, w_{\max}} \times [0, v_{\max}] \times {alive, dead} \times {0, 1} \times {[0, d_{view}]}^{3} \times {[- v_{\max}, v_{\max}]}^{3}

.

3.3.2. Action

Each agent action is

a = (a_{x}, x_{y}) \in A = {- 1, 0, 1} \times {- 1, 0, 1}

. Intuitively

a_{x}

is responsible for updating the linear speed in the following way:

a_{x} = 1

corresponds to accelerating,

a_{x} = - 1

corresponds to breaking, and

a_{x} = 0

keeps the linear speed unchanged. On the other hand

a_{y}

is responsible for updating the lane-position in the following way:

a_{y} = 1

corresponds to moving to the left lane,

a_{y} = - 1

corresponds to moving to the right lane, and

a_{y} = 0

to keeping the lane-position unchanged.

3.4. Rules

A traffic rule is a tuple

(b, w)

where

b : O \to {f a l s e, t r u e}

is the rule predicate, defined on the drivers observation space O, and

w \in R

is the rule weighting factor. The ith driver breaks a rule at a given time step t if the statement b that defines the rule is

b (o_{i}^{(t)}) = 1

. We define a set of three rules

((b_{1}, w_{1}), (b_{2}, w_{2}), (b_{3}, w_{3}))

, described in the next sections, that we use to simulate the real-world traffic rules for the drivers. All the drivers are subjected to the rules.

3.4.1. Intersection Rule

In this road scenario, we do not enforce any junction access negotiation protocol, nor we consider traffic lights, and cars access interactions as in Figure 2. That is, there is no explicit reason for drivers to slow down when approaching a junction, other than the chances of collisions with other cars crossing the intersection at the same time. Motivated by this lack of safety at intersections, we define a traffic rule that punishes drivers approaching or crossing an intersection at high linear speed.

In particular, the driver in road element p such that

p \in I

is an intersection, or equivalently

p \in S

and its car is in the proximity of an intersection, denoted by

x_{stop} < 2 l_{car}

, breaks the intersection rule indicated by

(b_{1}, w_{1})

if traveling at linear speed

v_{x} > 10

:

b_{1} (o) = \{\begin{matrix} 1 & if (p \in I \lor x_{stop} < 2 l_{car}) \land v_{x} > 10 \\ 0 & otherwise \end{matrix}

(11)

3.4.2. Distance Rule

Collisions may occur when traveling with insufficient distance from the car ahead, since it is difficult to predict the leading car behavior in advance. For this reason, we introduce a rule that punishes drivers that travel too close to the car ahead.

In particular, the driver observing

c_{0}^{closest}

closest car on the same lane breaks the distance rule indicated by

(b_{2}, w_{2})

if traveling at linear speed

v_{x}

such that the distance traveled before arresting the vehicle is greater than

δ_{x, 0} - l_{car}

, or, in other words:

b_{2} (o) = \{\begin{matrix} 1 & if δ_{x, 0} - l_{car} < 2 a_{\max} v_{x}^{2} \\ 0 & otherwise \end{matrix}

(12)

3.4.3. Right Lane Rule

In this scenario, cars might occupy any lane on a road segment, without any specific constraint. This freedom might cause the drivers to unpredictably change lanes while traveling, thus endangering other drivers, who might not have the chance to avoid the oncoming collision. Motivated by this potentially dangerous behaviors, we define a rule that allows drivers to overtake when close to the car ahead, but punishes the ones leaving the right-most free lane on a road section.

In particular, the driver occupying road section

p \in S

, on non-rightmost lane

y > 1

, breaks the right lane rule indicated by

(b_{3}, w_{3})

if the closest car on the right lane

c_{- 1}^{closest}

is traveling at a distance

δ_{x, - 1} = d_{view}

:

b_{3} (o) = \{\begin{matrix} 1 & if p \in S \land y > 1 \land δ_{x, - 1} = d_{view} \\ 0 & otherwise \end{matrix}

(13)

3.5. Reward

Drivers are rewarded according to their linear speed, thus promoting efficiency. All cars involved in a collision, denoted by state

s = dead

, are then arrested after the impact, thus resulting in zero reward for the next

t_{dead} - t_{coll}

time steps, which implicitly promotes safety. Each driver reward at time t is:

r^{(t)} = \frac{v_{x}^{(t)}}{v_{\max}} - \sum_{i = 1}^{3} w_{i} b_{i} (o^{(t)})

(14)

where w are the weights of the rules.

3.6. Policy Learning

Each driver’s goal is to maximize the return over a simulation, indicated by

\sum_{t = 0}^{T} γ^{t} r^{(t + 1)}

, where

γ \in [0, 1]

is the discount factor and

T > 0

is the number of time steps of the simulation. The driver policy is the function

π_{θ} : O \to A

that maps observations to actions. We parameterize the drivers’ policy in the form of a feed-forward neural network, where

θ

is the set of parameters of the neural network. Learning the optimal policy corresponds to the problem of finding the values of

θ

that maximize the return over an entire simulation. We perform policy learning by means of RL.

4. Experiments

Our goal was to experimentally assess the impact of the traffic rules on the optimized policies, in terms of overall efficiency and safety. To this aim, we defined 3 tuples, which are, respectively, the reward tuple R, the efficiency tuple E, and the collision tuple C.

The reward tuple

R \in R^{n_{car}}

is the tuple of individual rewards collected by the drivers during an episode, from

t = 0

to

t = T

, and is defined as:

R = (\sum_{t = 0}^{T} r_{1}^{(t)}, \dots, \sum_{t = 0}^{T} r_{n_{cars}}^{(t)})

(15)

The efficiency tuple

E \in R^{n_{car}}

is the tuple of sums of individual instant linear speed

v_{x}

for each driver during an episode, from

t = 0

to

t = T

, and is defined as:

E = (\sum_{t = 0}^{T} v_{x_{1}}^{(t)}, \dots, \sum_{t = 0}^{T} v_{x_{n_{cars}}}^{(t)})

(16)

The collision tuple

C \in N^{n_{car}}

is the tuple of individual collisions for each driver during an episode, from

t = 0

to

t = T

, and is defined as:

C = (\sum_{t = 0}^{T} 1 {s_{1}^{(t - 1)} = alive \land s_{1}^{(t)} = dead}, \dots, \sum_{t = 0}^{T} 1 {s_{n_{c a r s}}^{(t - 1)} = alive \land s_{n_{c a r s}}^{(t)} = dead})

(17)

Each ith element

c_{i}

of this tuple is defined as the number of times in which the ith driver change its car status

s_{i}

from

s_{i} = alive

to

s_{i} = dead

between 2 consecutive time steps

t - 1

and t.

We considered 2 different driving scenarios in which we aimed at finding optimal policy parameters:y “no-rules”, in which traffic rules weighting factors are

w_{1} = w_{1} = w_{3} = 0

, such that drivers are not punished for breaking the rules, and “rules”, in which traffic rules weighting factors are

w_{1} = w_{2} = w_{3} = 1

, such that drivers are punished for breaking the rules, and all the rules have the same relevance.

Moreover, we considered 2 different collision scenarios:

(a): cars are kept with status $s = dead$ in the road graph for $t_{dead}$ time steps, and then are removed; and
(b): cars are kept with status $s = dead$ in the road graph for $t_{dead}$ time steps, and then their status is changed back into $s = alive$ .

The rationale for considering the second option is that the condition in which we remove collided cars after

t_{dead}

time steps may not be good enough for finding the optimal policy. This assumption could ease the task of driving for the non-collided cars, when the number of collided cars grows, and, on the other side, it might provide too few collisions to learn from.

We simulated

n_{cars}

cars sharing the same driver policy parameters and moving in the simple road graph in Figure 3 for T time steps. This road graph has 1 main intersection at the center, and 4 three-way intersections. All road segments

p \in S

have the same length

l (p)

and same number of lanes

w (p)

. We used the model parameters shown in Table 1 and performed the simulations using Flow [36], a microscopic discrete-time continuous-space road traffic simulator that allows implementing our scenarios.

We repeated

n_{trials}

experiments in which we performed

n_{train}

training iterations in order to optimize the initial random policy parameters

θ_{no - rules}

and

θ_{rules}

. We collected the values, across the

n_{trials}

repetitions, of R, E, and C during the training.

We employed Proximal Policy Optimization (PPO) [37] as the RL policy optimization algorithm: PPO is a state-of-the-art actor-critic algorithm that is highly effective, while being almost parameters-free. We used the PPO default configuration (https://ray.readthedocs.io/en/latest/rllib-algorithms.html) with the parameters shown in Table 2. The drivers policy is in the form of an actor-critic neural networks model, where each of the 2 neural networks is made of 2 hidden layers, each one with 256 neurons and hyperbolic tangent as activation function. The hidden layer parameters are shared between the actor and the critic networks: this is a common practice introduced by Mnih et al. [38] that helps to improve the overall performances of the model. The parameters of the actor network as well as the ones of the critic network are initially distributed according to the Xavier initializer [39].

5. Results

Figure 4 and Figure 5 show the training results in terms of the tuples R, E, and C for the 2 policies

θ_{no - rules}

and

θ_{rules}

in the two collision scenarios considered.

In all experimental scenarios, the policy learned with rules shows driving behaviors that are less efficient than the ones achieved by the one without rules. On the other hand, the policy learned without rules is not even as efficient as it could theoretically be, due to the high number of collisions that make it difficult to avoid collided cars. Moreover, the values of E for the drivers employing the rules are distributed closer to the mean efficiency value, and thus we can assume this is due to the fact that the rules limit the space of possible behaviors to a smaller space with respect to the case without rules. In other words, rules seems to favor equity among drivers.

On the other hand, the policy learned with rules shows driving behaviors that are safer than the ones achieved by the one without rules. This may be due to the fact that training every single driver to avoid collisions based only on the efficiency reward is a difficult learning task, as well as because agents are not capable of predicting the other agents’ trajectories. On the other hand, we can see that the simple traffic rules that we have designed are effective at improving the overall safety.

In other words, these results show that, as expected, policies learned with rules are safer but less efficient than the ones without rules. Interestingly, the rules act also as a proxy for equality, as shown in Figure 4 and Figure 5, in particular for the efficiency values of E, where the blue shaded area is much thinner than the red one, meaning that all the

n_{car}

vehicles have similar efficiency.

Robustness to Traffic Level

With the aim of investigating the impact of the traffic level on the behavior observed with the learned policies (in the second learning scenario), we performed several other simulations by varying the number of cars in the road graph. Upon each simulation, we measured the overall distance traveled

\sum_{i = 1}^{n_{car}} E_{i} Δ t

and overall collisions

\sum_{i = 1}^{n_{car}} C_{i}

. We considered the overall sums, instead of the average, of these indexes in order to investigate the impact of the variable number of cars in the graph: in principle, the larger is this number, the longer is the overall distance that can be potentially traveled, and, likely, the larger is the number of collisions.

We show the results of this experiment in Figure 6, where each point corresponds to indexes observed in a simulation with a given traffic level

n_{car}

: we considered values in

10, 20, \dots, 80

. We repeated the same procedure for both the drivers trained with and without the rules, using the same road graph in which the drivers have been trained. For each level of traffic injected, we simulated T time steps and we measured the overall distance and overall number of collisions occurred.

As shown in Figure 6, the two policies (corresponding to learning with and without rules) exhibit very different outcomes as the injected traffic increases. In particular, the policy optimized without rules results in an overall number of collisions that increases, apparently without any bound in these experiments, as the traffic level increases. Conversely, the policy learned with the rules keeps the overall number of collisions much lower also with heavy traffic. Interestingly, the limited increase in collisions is obtained by the policy with the rules at the expense of overall traveled distance, i.e., of traveling capacity of the traffic system.

From another point of view, Figure 6 shows that a traffic system where drivers learned to comply with the rules is subjected to congestion: when the traffic level exceeds a given threshold, introducing more cars in the system does not allow obtaining a longer traveled distance. Congestion is instead not visible (at least not in the range of traffic levels that we experimented with) with policies learned without rules; the resulting system, however, is unsafe. Overall, congestion acts here as a mechanism, induced by rules applied during the learning, for improving the safety of the traffic system.

6. Conclusions

We investigated the impact of imposing traffic rules while learning the policy for AI-powered drivers in a simulated road traffic system. To this aim, we designed a road traffic model that allows analyzing system-wide properties, such as efficiency and safety, and, at the same time, permits learning using a state-of-the-art RL algorithm.

We considered a set of rules inspired by real traffic rules and performed the learning with a positive reward for traveled distance and a negative reward that punishes driving behaviors that are not compliant with the rules. We performed a number of experiments and compared them with the case where rules compliance does not impact on the reward function.

The experimental results show that imposing the rules during learning results in learned policies that gives safer traffic. The increase in safety is obtained at the expense of efficiency, i.e., drivers travel, on average, slower. Interestingly, the safety is also improved after the learning—i.e., when no reward exists, either positive or negative—and despite the fact that, while training, rules are not enforced. The flexible way in which rules are taken into account is relevant because it allows the drivers to learn whether to evade a certain rule or not, depending on the current situation, and no action is prohibited by design: rules stand hence as guidelines, rather then obligation, for the drivers. For instance, a driver might have to overtake another vehicle in a situation in which overtaking is punished by the rules, if this decision is the only one that allows avoiding a forthcoming collision.

Our work can be extended in many ways. One theme of investigation is the robustness of policies learned with rules in the presence of other drivers, either AI-driven or human, who are not subjected to rules or perform risky actions. It would be interesting to assess how the driving policies learned with the approach presented in this study operate in such situations.

From a broader point of view, our findings may be useful in the situations where there is a trade-off between compliance with the rules and a greater good. With the ever increasing pervasiveness of AI-driven automation in many domains (e.g., robotics and content generation), relevance and quantity of these kinds of situations will increase.

Author Contributions

Conceptualization, E.M. and J.T.; Methodology, E.M.; software, J.T.; investigation, J.T.; writing-original draft preparation, J.T.; writing-review and editing, A.B. and A.D.L.; supervision, E.M.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Howard, D.; Dai, D. Public perceptions of self-driving cars: The case of Berkeley, California. In Proceedings of the Transportation Research Board 93rd Annual Meeting, Washington, DC, USA, 12–16 January 2014; Volume 14, pp. 1–16. [Google Scholar]
Skrickij, V.; Sabanovic, E.; Zuraulis, V. Autonomous Road Vehicles: Recent Issues and Expectations. IET Intell. Transp. Syst. 2020. [Google Scholar] [CrossRef]
Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to end learning for self-driving cars. arXiv 2016, arXiv:1604.07316. [Google Scholar]
Maqueda, A.I.; Loquercio, A.; Gallego, G.; García, N.; Scaramuzza, D. Event-based vision meets deep learning on steering prediction for self-driving cars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2018; pp. 5419–5427. [Google Scholar]
Sharifzadeh, S.; Chiotellis, I.; Triebel, R.; Cremers, D. Learning to drive using inverse reinforcement learning and deep q-networks. arXiv 2016, arXiv:1612.03653. [Google Scholar]
Jaritz, M.; De Charette, R.; Toromanoff, M.; Perot, E.; Nashashibi, F. End-to-end race driving with deep reinforcement learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 2070–2075. [Google Scholar]
Bouton, M.; Nakhaei, A.; Fujimura, K.; Kochenderfer, M.J. Safe reinforcement learning with scene decomposition for navigating complex urban environments. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 1469–1476. [Google Scholar]
Wang, C.; Liu, L.; Xu, C. Developing a New Spatial Unit for Macroscopic Safety Evaluation Based on Traffic Density Homogeneity. J. Adv. Transp. 2020, 2020, 1718541. [Google Scholar] [CrossRef]
Qiao, Z.; Muelling, K.; Dolan, J.; Palanisamy, P.; Mudalige, P. Pomdp and hierarchical options mdp with continuous actions for autonomous driving at intersections. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2377–2382. [Google Scholar]
Tram, T.; Jansson, A.; Grönberg, R.; Ali, M.; Sjöberg, J. Learning negotiating behavior between cars in intersections using deep q-learning. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 3169–3174. [Google Scholar]
Liebner, M.; Baumann, M.; Klanner, F.; Stiller, C. Driver intent inference at urban intersections using the intelligent driver model. In Proceedings of the 2012 IEEE Intelligent Vehicles Symposium, Alcala de Henares, Spain, 3–7 June 2012; pp. 1162–1167. [Google Scholar]
Isele, D.; Cosgun, A.; Subramanian, K.; Fujimura, K. Navigating intersections with autonomous vehicles using deep reinforcement learning. arXiv 2017, arXiv:1705.01196. [Google Scholar]
Capasso, A.P.; Bacchiani, G.; Molinari, D. Intelligent Roundabout Insertion using Deep Reinforcement Learning. arXiv 2020, arXiv:2001.00786. [Google Scholar]
Shalev-Shwartz, S.; Shammah, S.; Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv 2016, arXiv:1610.03295. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: London, UK, 2018. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8 December 2012; pp. 1097–1105. [Google Scholar]
Sallab, A.E.; Abdou, M.; Perot, E.; Yogamani, S. Deep reinforcement learning framework for autonomous driving. Electron. Imaging 2017, 2017, 70–76. [Google Scholar] [CrossRef] [Green Version]
Loiacono, D.; Prete, A.; Lanzi, P.L.; Cardamone, L. Learning to overtake in TORCS using simple reinforcement learning. In Proceedings of the IEEE Congress on Evolutionary Computation, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
Hoel, C.J.; Wolff, K.; Laine, L. Automated speed and lane change decision making using deep reinforcement learning. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2148–2155. [Google Scholar]
Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2019. [Google Scholar] [CrossRef]
Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.A.; Yogamani, S.; Pérez, P. Deep Reinforcement Learning for Autonomous Driving: A Survey. arXiv 2020, arXiv:2002.00444. [Google Scholar]
Brodsky, J.S. Autonomous vehicle regulation: How an uncertain legal landscape may hit the brakes on self-driving cars. Berkeley Technol. Law J. 2016, 31, 851–878. [Google Scholar]
Holstein, T.; Dodig-Crnkovic, G.; Pelliccione, P. Ethical and social aspects of self-driving cars. arXiv 2018, arXiv:1802.04103. [Google Scholar]
Nyholm, S.; Smids, J. Automated cars meet human drivers: Responsible human-robot coordination and the ethics of mixed traffic. In Ethics and Information Technology; Springer: Cham, Switzerland, 2018; pp. 1–10. [Google Scholar]
Kirkpatrick, K. The Moral Challenges of Driverless Cars. Commun. ACM 2015, 58, 19–20. [Google Scholar] [CrossRef]
Rizaldi, A.; Althoff, M. Formalising traffic rules for accountability of autonomous vehicles. In Proceedings of the 2015 IEEE 18th International Conference on Intelligent Transportation Systems, Las Palmas, Spain, 15–18 September 2015; pp. 1658–1665. [Google Scholar]
Vanholme, B.; Gruyer, D.; Lusetti, B.; Glaser, S.; Mammar, S. Highly automated driving on highways based on legal safety. IEEE Trans. Intell. Transp. Syst. 2013, 14, 333–347. [Google Scholar] [CrossRef]
Medvet, E.; Bartoli, A.; Talamini, J. Road traffic rules synthesis using grammatical evolution. In Proceedings of the European Conference on the Applications of Evolutionary Computation, Amsterdam, The Netherlands, 19–21 April 2017; pp. 173–188. [Google Scholar]
O’Neill, M.; Ryan, C. Grammatical evolution. IEEE Trans. Evol. Comput. 2001, 5, 349–358. [Google Scholar] [CrossRef] [Green Version]
Nenzi, L.; Bortolussi, L.; Ciancia, V.; Loreti, M.; Massink, M. Qualitative and quantitative monitoring of spatio-temporal properties. In Runtime Verification; Springer: Cham, Switzerland, 2015; pp. 21–37. [Google Scholar]
Bartocci, E.; Bortolussi, L.; Loreti, M.; Nenzi, L. Monitoring mobile and spatially distributed cyber-physical systems. In Proceedings of the 15th ACM-IEEE International Conference on Formal Methods and Models for System Design, Vienna, Austria, 29 September 2017; pp. 146–155. [Google Scholar]
Tumova, J.; Hall, G.C.; Karaman, S.; Frazzoli, E.; Rus, D. Least-violating control strategy synthesis with safety rules. In Proceedings of the 16th International Conference on Hybrid Systems: Computation and Control, Philadelphia, PA, USA, 8–11 April 2013; pp. 1–10. [Google Scholar]
Saunders, W.; Sastry, G.; Stuhlmueller, A.; Evans, O. Trial without error: Towards safe reinforcement learning via human intervention. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, Stockholm, Sweden, 10–15 July 2018; pp. 2067–2069. [Google Scholar]
Mirchevska, B.; Pek, C.; Werling, M.; Althoff, M.; Boedecker, J. High-level decision making for safe and reasonable autonomous lane changing using reinforcement learning. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2156–2162. [Google Scholar]
Wu, C.; Kreidieh, A.; Parvate, K.; Vinitsky, E.; Bayen, A.M. Flow: A Modular Learning Framework for Autonomy in Traffic. arXiv 2017, arXiv:1710.05465. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 24–19 June 2016; pp. 1928–1937. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]

Figure 1. Distance between cars in different cases.

Figure 2. Cars approaching intersections.

Figure 3. The road graph used in the experiments.

Figure 4. Training results with cars removed after

t_{dead}

time steps. Here, we draw the training values of R, E, and C, at a certain training episode, averaged on

n_{trial}

experiments. We indicate with solid lines the mean of R, E, and C among the

n_{car}

vehicles, and with shaded areas their standard deviation among the

n_{car}

vehicles.

Figure 4. Training results with cars removed after

t_{dead}

time steps. Here, we draw the training values of R, E, and C, at a certain training episode, averaged on

n_{trial}

experiments. We indicate with solid lines the mean of R, E, and C among the

n_{car}

vehicles, and with shaded areas their standard deviation among the

n_{car}

vehicles.

Figure 5. Training results with cars restored after

t_{dead}

time steps. Here, we draw the training values of R, E, and C, at a certain training episode, averaged on

n_{trial}

experiments. We indicate with solid lines the mean of R, E, and C among the

n_{car}

vehicles, and with shaded areas their standard deviation among the

n_{car}

vehicles.

Figure 5. Training results with cars restored after

t_{dead}

time steps. Here, we draw the training values of R, E, and C, at a certain training episode, averaged on

n_{trial}

experiments. We indicate with solid lines the mean of R, E, and C among the

n_{car}

vehicles, and with shaded areas their standard deviation among the

n_{car}

vehicles.

Figure 6. Overall number of collisions in the simulation against the overall traveled distance in the simulation, averaged across simulations with the same

n_{car}

. Each dot is drawn from the sum of the values computed on the

n_{car}

vehicles.

Figure 6. Overall number of collisions in the simulation against the overall traveled distance in the simulation, averaged across simulations with the same

n_{car}

. Each dot is drawn from the sum of the values computed on the

n_{car}

vehicles.

Table 1. Model and simulation parameters.

Param	Meaning	Value
$l_{car}$	Car length	7
$t_{coll}$	Impact duration	10
$t_{dead}$	Collision duration	20
$d_{view}$	Driver’s view distance	50
$v_{\max}$	Driver’s maximum speed	50
$a_{\max}$	Driver’s acceleration (deceleration)	2
$Δ t$	Time step duration	0.2
$\| S \|$	Number of road sections	12
$\| I \|$	Number of road intersections	9
$w (p), p \in G$	Number of lanes	$\in {1, 2}$
$l (p), p \in S$	Section length	100
$n_{car}$	Cars in the simulation	40
T	Simulation time steps	500

Table 2. Policy learning algorithm parameters.

Param	Meaning	Value
$n_{trial}$	Number of trials	20
$n_{train}$	Training iterations	500
$n_{car}$	Cars in the simulation	40
$γ$	Discount factor	0.999

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Talamini, J.; Bartoli, A.; De Lorenzo, A.; Medvet, E. On the Impact of the Rules on Autonomous Drive Learning. Appl. Sci. 2020, 10, 2394. https://doi.org/10.3390/app10072394

AMA Style

Talamini J, Bartoli A, De Lorenzo A, Medvet E. On the Impact of the Rules on Autonomous Drive Learning. Applied Sciences. 2020; 10(7):2394. https://doi.org/10.3390/app10072394

Chicago/Turabian Style

Talamini, Jacopo, Alberto Bartoli, Andrea De Lorenzo, and Eric Medvet. 2020. "On the Impact of the Rules on Autonomous Drive Learning" Applied Sciences 10, no. 7: 2394. https://doi.org/10.3390/app10072394

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Impact of the Rules on Autonomous Drive Learning

Abstract

1. Introduction

2. Related Works

3. Model

3.1. Road Graph

3.2. Cars

3.3. Drivers

3.3.1. Observation

3.3.2. Action

3.4. Rules

3.4.1. Intersection Rule

3.4.2. Distance Rule

3.4.3. Right Lane Rule

3.5. Reward

3.6. Policy Learning

4. Experiments

5. Results

Robustness to Traffic Level

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI