Real-Time Online Goal Recognition in Continuous Domains via Deep Reinforcement Learning

Fang, Zihao; Chen, Dejun; Zeng, Yunxiu; Wang, Tao; Xu, Kai

doi:10.3390/e25101415

Open AccessArticle

Real-Time Online Goal Recognition in Continuous Domains via Deep Reinforcement Learning

by

Zihao Fang

,

Dejun Chen

,

Yunxiu Zeng

,

Tao Wang

and

Kai Xu

^*

College of Systems Engineering, National University of Defense Technology, Changsha 410000, China

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(10), 1415; https://doi.org/10.3390/e25101415

Submission received: 30 August 2023 / Revised: 23 September 2023 / Accepted: 3 October 2023 / Published: 4 October 2023

(This article belongs to the Section Multidisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The problem of goal recognition involves inferring the high-level task goals of an agent based on observations of its behavior in an environment. Current methods for achieving this task rely on offline comparison inference of observed behavior in discrete environments, which presents several challenges. First, accurately modeling the behavior of the observed agent requires significant computational resources. Second, continuous simulation environments cannot be accurately recognized using existing methods. Finally, real-time computing power is required to infer the likelihood of each potential goal. In this paper, we propose an advanced and efficient real-time online goal recognition algorithm based on deep reinforcement learning in continuous domains. By leveraging the offline modeling of the observed agent’s behavior with deep reinforcement learning, our algorithm achieves real-time goal recognition. We evaluate the algorithm’s online goal recognition accuracy and stability in continuous simulation environments under communication constraints.

Keywords:

online goal recognition; deep reinforcement learning; continuous domain; communication constraints; information entropy

1. Introduction

Goal recognition is a form of intent recognition that differs from plan recognition in that it focuses on determining the high-level objective pursued by an agent, rather than its specific action plan. Its aim is to infer the top-level goal of an agent based on input observation sequences and output an explanation of the observation sequence [1]. Currently, goal recognition has been applied in various fields, including intelligent assistants [2,3], autonomous driving [4,5], robot navigation [6,7], military confrontation [8], and more [9]. For example, in the application of goal recognition in a smart home assistant [2,3], the input observation sequence consists of a series of actions performed by a person as observed by a camera. By using a goal recognition algorithm, the system can, for example, identify the goal of an elderly person as needing to go to the kitchen. It effectively assists systems in understanding and predicting the intentions and behavior of agents, as well as assisting humans in completing complex tasks. As such, goal recognition has become an increasingly important area of research with practical applications in a variety of domains.

The task of goal recognition can be addressed through plan-based and learning-based intent recognition methods. Plan-based methods involve decomposing the goal recognition problem into planning and recognition components [10]. The recognition component selects the most probable goal explanation for each observed action by utilizing the planner. However, this approach is limited by challenges such as the need for a well-defined domain description, noisy observations, and high online computational complexity. Therefore, plan-based methods have certain drawbacks that need to be addressed. In contrast to plan-based methods, recent research has proposed a reinforcement learning-based framework for goal recognition [11] that adopts an evaluative approach. Unlike traditional plan-based methods that compare the generated optimal plan with the observation sequence, the reinforcement learning-based approach learns the corresponding action Q-table for each state through Q-learning. This approach effectively addresses the drawbacks of plan-based recognition methods and outputs the most likely goal by evaluating the observation sequence. However, the proposed method has only been tested in discrete Planning Domain Definition Language (PDDL) [11,12] and is not suitable for continuous environments such as robot navigation, where the Q-table cannot fully describe the infinite and continuous action space. Furthermore, the method has only been evaluated through offline state recognition testing [13,14], and the challenges of online recognition using incremental inputs of observation sequences and communication constraints have not yet been addressed.

This paper introduces a novel framework called Goal Recognition as Deep Reinforcement Learning (GR_DRL), which employs deep reinforcement learning for achieving online goal recognition tasks in the context of robot navigation in a continuous domain. The framework utilizes a deep neural network to output the reward values of actions in a continuous action space. By utilizing the cumulative Q-measure based on the GR_RL framework [11], the proposed approach enables online goal recognition in a continuous domain.

This paper presents three significant contributions. First, we extend the reinforcement learning-based goal recognition framework GR_RL proposed in [11] to the continuous domain. Our framework consists of two main stages: modeling opponent behavior using the TD3 deep reinforcement learning algorithm and learning the behavior policy of the agent in continuous environments, with the navigation task serving as an example. During the goal recognition process, a deep neural network in the trained model is utilized to evaluate continuous observation–action sequences and perform online inference to determine possible goals. Second, we address various issues encountered during online recognition in real robot navigation environments, such as sensor failures, asynchronous sampling rates, and communication interference (observation noise) [13]. Third, we evaluate the online goal recognition speed of the GR_DRL algorithm in continuous navigation environments. In summary, our paper proposes an online goal recognition algorithm based on deep reinforcement learning and validates it in robot navigation tasks in a continuous domain. Our experiments demonstrate that the GR_DRL algorithm achieves excellent recognition accuracy and robustness while maintaining fast online recognition speed.

This paper is structured as follows: We begin by introducing the relevant concepts and methods of goal recognition and formally defining the online goal recognition problem in a continuous domain. We then present the framework and model design for solving goal recognition problems based on deep reinforcement learning. Subsequently, we describe the experimental environment for online robot navigation tasks in the continuous domain, outline the experimental assumptions, and provide a qualitative and quantitative analysis of the experimental results. Finally, we analyze and discuss the advantages and limitations of the experimental results and the model and identify future research directions.

2. Background

2.1. Goal Recognition Problem

Goal recognition is a crucial aspect of intention recognition, which can be classified into action, plan, and goal recognition based on the recognized level of abstraction [15]. Action recognition operates at the lowest level, taking noisy sensor signals as input and identifying the lowest-level actions, such as turning left or going straight. Goal recognition, on the other hand, operates at the highest level, taking a sequence of discrete symbols as input and identifying the top-level goal that can explain the observed sequence, such as a specific destination in a navigation task. Plan recognition serves as a bridge between action and goal recognition, also taking a sequence of discrete symbols as input. It identifies both the top-level goal that explains the observed sequence and each action that leads to the completion of the top-level goal, such as a specific destination and the ordered sequence of agent movements in a navigation task [16]. Since the top-level goal in plan recognition is usually the final goal of the task, many methods in plan recognition are also applicable to goal recognition. Therefore, our paper will introduce them together.

Initially, we identify the common consensus within the field to define the GR problem based on intention recognition (Meneguzzi and Pereira 2021; Mirsky, Keren, and Geib 2021) [15,17]. As formally defined in Definition 1, the GR problem consists of a tuple, where

D

represents the domain theory,

G

represents the set of potential goals, and

O

represents a sequence of observations. The objective of the GR problem is to find the best goal g that explains the observation sequence.

Definition 1.

(Goal recognition problem) A goal recognition problem is a tuple

T = 〈 D, G, O 〉

composed of a domain theory

D

, a set of potential goals

G

, and a sequence of observations

O

. The objective of the problem is to identify a goal

g \in G

that provides an explanation for the given observation sequence

O

.

The current dissimilarities among different approaches primarily stem from the formulation of the domain theory and the techniques employed for interpreting the observation sequence

O

. For example, planning-based methods utilize domain knowledge derived from planning to establish the domain and convert the recognition process into a planning procedure. The most probable goal that can explicate the observation sequence is deduced by comparing the observation sequence and the planning sequence [10,18]. Conversely, learning-based methods primarily utilize historical or interactive data to acquire knowledge about the domain of the goal individual [19,20].

2.2. Goal Recognition as Planning

The concept of planning-based goal recognition was first presented by Ramírez and Geffner in 2009, through their work on Plan Recognition as Planning (PRAP) [10]. Subsequent advancements in this field have consistently adopted the same fundamental principle, which involves utilizing classical planning concepts to calculate a probability distribution over a set of potential plans or goals [6,10]. The recognition process is modeled as the reverse of the planning process, which involves decomposing the traditional recognition process into two distinct components: recognition and planning. The planner is responsible for generating feasible paths, while the recognizer processes the observation sequence, invokes the planner, and computes the probability distribution.

Definition 2.

(Goal Recognition as Planning) A goal recognition as planning problem is defined by a tuple

T = 〈 D p, G, O 〉

. Here, the planning-based domain theory

D p

is a tuple

D p = 〈 F, s_{0}, A 〉

, where

s_{0} \subseteq F

denotes the initial state and

A

is a set of actions. Each action

a \in A

has preconditions

P r e (a) \subseteq F

and lists of fluents

A d d (a) \subseteq F

and

D e l (a) \subseteq F

that describe the effects of the action a in terms of fluents that are added and deleted from the current state. Additionally, actions have non-negative costs

c (a)

, and the cost of a plan is defined as

c (π) = \sum_{i}^{} c (a i)

. Meanwhile,

G

denotes a set of possible goals, where each goal

g \subseteq F

, and

O

represents a sequence of observations

O = O_{1}, \dots O_{m}

, where each observation

O_{i} \in A

is a sequence of actions that have been observed.

Currently, planning-based methods for goal recognition are based on the idea of using Bayesian inference to calculate the posterior probability of a goal [10,21], assuming that the prior probabilities of each possible goal in the goal set are given in the problem:

P (G ∣ O) = α P (O ∣ G) P (G)

(1)

Therefore, the focus of the goal recognition problem is transformed into an estimation problem for probabilities, and planning-based methods utilize planning techniques to address this issue.

Ramírez and Geffner (2010) [10] postulate that agents act with complete rationality, utilizing strictly optimal plans that minimize costs to achieve their objectives. Furthermore, they assume that the probability of a goal becoming the actual objective of an agent can be estimated by the cost difference between the optimal plan that includes a given observation sequence

O

and the optimal plan that does not include

O

, while achieving a given goal

g \subseteq G

. This is because the optimal plan that does not need to satisfy the requirement of including

O

, according to the planning domain, is a fully rational plan from the given initial state to the given goal g. Therefore, when the cost of the optimal plan that includes

O

is higher, it implies that the agent is further away from the true objective. The authors present a method to compute

P (O ∣ g)

as follows:

Δ (g) = c (O, g) - c (\bar{O}, g)

(2)

P (O ∣ g) = α \frac{exp {- β Δ (g)}}{1 + exp {- β Δ (g)}}

(3)

where

α

is the normalization factor and

Δ (g) = c (O, g) - c (\bar{O}, g)

represents the cost difference between the optimal plan that satisfies observation

O

for goal g and the optimal plan that does not satisfy

O

for goal g. The costs

c (O, g)

and

c (\bar{O}, g)

can be calculated using classical planning systems.

Many studies have utilized automated planning techniques to conduct research on goal recognition, building on the foundational computational principles outlined above [10,18,22]. In the context of the domain hypothesis, Ramírez and Geffner (2011) and Oh et al. (2011) [23] investigated the stochastic variability of planning domains

D p

, utilizing Markov models to represent this variability. Competitive relationships in goal recognition can be broadly categorized into three types, namely keyhole, intended, and adversarial recognition [24]. Intended recognition refers to the recognition process, where the observed agent is conscious of the recognition process and is often cooperative with the recognition process. Adversarial recognition, on the other hand, refers to instances where the observed agent is aware of the recognition process but chooses not to cooperate with it. Keyhole recognition describes a recognition process where the observed intelligent agent is unaware of the recognition process, and where the recognition process faces the challenge of partially observable inputs. While most goal recognition methods based on planning [7,10,25] are based on discrete domains, Vered and Kaminka [13,14,26] provided the first formal definition of goal recognition in continuous domains, which has since been further explored in subsequent studies [8].

Definition 3.

(Continuous and Discrete Domains) In a continuous domain model

D

, the state space

S

is defined as a subset of an n-dimensional Euclidean space, denoted as

S \subset R^{n} (n \geq 2)

. This type of domain model is generally used to represent environments with two or three dimensions. The action space

A

of a continuous domain is a discrete set of actions, which can be infinite, and encodes a transition function between states. In contrast, both the states and actions in discrete domains are discrete. For example, for the navigation environment illustrated in Figure 1, the discrete domain can be represented by an action space

A = 〈 left, right, up, down 〉

, an initial state

s_{0}

= 〈(robot loc-0-0), (box loc-4-5)…〉, and a sequence of observations

O

, represented by a yellow dashed arrow. In contrast, the continuous domain’s action space can be represented by continuous values of x, y, z, and θ. For the same navigation environment, the initial state

s_{0}

can be represented as

〈 x (r) = 0, y (r) = 0.5, z (r) = 0.1, θ (r) = 45^{\circ} 〉

.

The majority of planning-based approaches [7,10] have implemented an offline input method, which results in low efficiency when dealing with online incremental observation sequence inputs. This necessitates the repeated invocation of the offline goal recognition algorithm. Vered and Kaminka were the first to propose an efficient online goal recognition method in their work [13,14]. The main difference between the online and offline goal recognition methods lies in the input approach for the observation sequence

O

: the offline input approach involves providing the entire observation sequence to the recognition algorithm before execution, whereas the online goal recognition method involves providing the observation sequence

O

incrementally multiple times to the goal recognition algorithm.

2.3. Goal Recognition as Learning

Artificial intelligence and machine learning have advanced rapidly, leading to the emergence of a new paradigm for goal recognition based on learning theory. This paradigm can be categorized into two broad groups: model-based and model-free approaches, based on the learning method employed [27,28].

Model-based goal recognition focuses on learning the action model and domain theory of the recognizer. Amir et al. [20,29,30,31] employed various learning methods to study behavior models, but have not established a link between these models and the recognizer’s strategy. Zeng et al. [32] used inverse reinforcement learning to learn the recognizer’s reward and implemented a Markov-based goal recognition algorithm. However, for goal recognition, it is not necessary to learn the reward for the transition between all actions. To extract useful information from the image-based domain and perform goal recognition, Amado et al. [20] used a pre-trained encoder and LSTM network to represent and analyze observed state sequences, rather than relying on actions. Additionally, by training an LSTM-based system to recognize missing observations about states, Amado et al. [19] achieved improved performance for the model-based goal recognition method based on learning.

In contrast, model-free goal recognition based on learning requires no model and only utilizes the observed action sequence and initial state as input. Borrajo et al. [27,33] studied goal recognition using XGBoost and LSTM neural networks, which only employ observation sequences without any domain knowledge. However, they trained specific machine learning models for each goal recognition instance and used specific instance datasets for training and testing. M. Chiari [34] proposed a recognition network GRNet for goal recognition using an improved RNN network, which only requires the input of observed action data. The RNN network outputs the probability of the goal in the planning domain and has achieved good results in discrete PDDL planning examples.

Finally, Amado et al. [11] proposed a model for goal recognition behavior modeling based on reinforcement learning Q-learning, which combines model-free reinforcement learning with the latest goal recognition algorithms.

Definition 4.

(Goal Recognition as Reinforcement Learning) A goal recognition as reinforcement learning problem is defined by a tuple

T = 〈 D_{l}, G, O 〉

, where the domain theory

D_{l}

is divided into two types: utility-based

T_{Q} (G)

or policy-based

T_{π} (G)

. A utility-based domain theory

T_{Q} (G)

is represented by a tuple

(S, A, Q)

, where

Q

is a set of Q-functions

{\{Q_{g}\}}_{g \in G}

. On the other hand, a policy-based domain theory

T_{π} (G)

is represented by a tuple

(S, A, π)

, where π is a set of policies

{\{π_{g}\}}_{g \in G}

.

They converted the planning task described in the traditional planning language PDDL into a reinforcement learning environment, allowing for the direct learning of the recognizer’s utility function or policy. They also proposed utility-based

T_{Q} (G)

and policy-based

T_{π} (G)

domain theories and used the metric of accumulated Q-value for reasoning. And in the context of goal recognition, the use of policy-based domain theory

T_{π} (G)

can be replaced by a utility-based domain theory

T_{Q} (G)

. This replacement can be achieved by generating a softmax policy

π_{g}

based on

Q_{g}

for each goal g, as shown in Equation (4).

π_{g} (a ∣ s) = \frac{Q_{g} (s, a)}{\sum_{a^{'} \in A} Q_{g} (s, a^{'})}

(4)

This approach effectively leverages reinforcement learning techniques to acquire domain models and mitigate the impact of noise on goal recognition by employing an evaluative strategy, thereby enhancing the speed of goal identification. However, the existing framework is constrained to finite and discrete action spaces, rendering it inadequate for addressing the challenges posed by infinite state and action spaces in continuous domains. To overcome this limitation, this study introduces a novel goal recognition framework based on deep reinforcement learning tailored for online continuous environments. The proposed framework investigates the learning process in continuous domains characterized by infinite action and state spaces using deep reinforcement learning. Furthermore, it explores methods for providing the most credible goal explanations for observed sequences. The accompanying Figure 2 elucidates the fundamental disparities between continuous and discrete domains within the context of the reinforcement learning framework.

3. The Goal Recognition as Deep Reinforcement Learning Framework

Our framework primarily comprises two integral modules. Firstly, an interactive learning methodology is employed for the evaluation of offline policy networks. Secondly, there is the incremental incorporation of observation sequences, concurrent with online goal inference and recognition. The workflow of this framework is visually depicted in Figure 2.

Figure 3 illustrates the algorithmic flow framework for offline training and online inference proposed for GR as DRL.

The initial module of the framework takes a training

m a p

and the continuous action space denoted as

A

and the state space

S

corresponding to the observed agent as inputs. This component generates a domain-theoretic representation

T_{π} (G)

, which undergoes refinement through training within the policy evaluation network.

The second module encompasses an online incremental goal recognition procedure that progresses with each discrete time step. It continuously receives the latest observation values, represented as

〈s, a〉

, derived from the observation sequence. Subsequently, a potential goal set

G

and the updated observation data, denoted as

O = 〈s_{0}, a_{0}, s_{1}, a_{1}, \dots〉

, are fed into the online goal recognition module. This module then determines the most probable goal, denoted as

g^{*}

, which provides the most coherent explanation for the observation sequence

O

.

In the online inference model, we employ a distance evaluation method, akin to planning-based approaches, as demonstrated in Equation (5).

g^{*} = \underset{g \in G}{arg min} Distance (Q_{π_{g}}, O)

(5)

The policy evaluation network

Q_{π_{g}}

, trained offline through reinforcement learning, generates Q-values for all possible goal g in the goal set

G

based on observation sequences

O

. According to the assumption of a rational agent, the goal

g^{*}

with the highest Q-value is considered the most likely.

Module One: Offline Training. We initiate the offline training phase by training the behavioral policy of the intelligent agent within a continuous domain. This training process involves planning navigation tasks for randomly selected goal g within the training

m a p

, with the objective of developing a rational navigation agent proficient in continuous domains. As outlined in Algorithm 1, this procedure results in the creation of a domain-theoretic representation

T_{π} (G)

based on the network architecture.

To commence the training, it is essential to initialize and optimize various training parameters, including reward functions

r (s, a)

and iteration counts T (line 1–3), similar to other deep reinforcement learning algorithms. During each training iteration, we randomly generate goal points within the

m a p

and compute reward values for states

S

and actions

A

based on the defined reward function. Subsequently, we update the policy network to generate the domain-theoretic representation

T_{π} (G)

.

Module Two: Online Inference. The policy evaluation network

Q_{π_{g}}

, acquired through offline training, together with the state space

S

and action space

A

, collectively form the domain-theoretic representation

T_{π} (G)

. In accordance with Algorithm 2, during the online inference process, as the observation sequence

O

incrementally updates, we perform inference on all targets within the potential goal set

G

utilizing the domain-theoretic representation. Subsequently, we update the goal

g^{*}

that provides the best fit to the observation sequence based on the distance evaluation method outlined in Equation (5).

This approach, employing an evaluation-based method with the aid of the assessment network, effectively mitigates the impact of observation sequence noise and missing data compared to the generative methods employed by planning-based goal recognition approaches.

Algorithm 1 Offline Training for Domain Theory

Require:

S

,

A

: State and action spaces in the continuous domain
Require:

m a p

: A map used for navigation tasks in the continuous domain

1:: Initializing DRL (deep reinforcement learning) parameters $γ$
2:: Initializing the reward function $r (s, a)$
3:: Initializing training iterations T
4:: for t = 1 to T do
5:: Randomly generating goal g in the $m a p$
6:: $\forall a \forall s \neq g, r (s, a) \leftarrow 0$
7:: $\forall a, s = g, r (g, a) \leftarrow C$
8:: $Q_{π_{g}} \leftarrow D R L (S, A, γ)$
9:: end for
10:: return $T_{π} (G)$

Algorithm 2 Online Infer most likely Goal for the Observations

Require:

T_{π} (G)

: State

S

and action

A

spaces in the continuous domain, and policy evaluation networks

Q_{π_{g}}

Require:

G

: a set of candidate goals
Require:

O

: an observation sequence

O = 〈s_{0}, a_{0}, s_{1}, a_{1}, \dots〉

1:: Initializing minimum distance $δ^{*}$
2:: while Observation sequence $O$ update do
3:: for $\forall g \in G$ do
4:: $δ \leftarrow Distance (Q_{π_{g}}, O)$
5:: if $δ \leq δ^{*}$ then
6:: $g^{*} \leftarrow g$ and $δ^{*} \leftarrow δ$
7:: end if
8:: end for
9:: end while
10:: return $g^{*}$

4. Goal Recognition as TD3

This section provides a comprehensive illustration of an online goal recognition algorithm in a continuous domain, leveraging the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, which is a deep reinforcement learning approach. We will systematically introduce the key principles of the TD3 algorithm, the essential parameter configurations for training the TD3 reinforcement learning algorithm, the rationale behind the reward function design, the convergence analysis of the TD3 algorithm training, and the measurement methodology for evaluating the policy evaluation network

Q_{π_{g}}

against the observation sequence

O

.

4.1. Basic Principles of the TD3 Algorithm

The TD3 algorithm is an actor–critic reinforcement learning algorithm that extends the deep Q-network with a policy network, allowing direct output of action values. It represents an improvement over the Deep Deterministic Policy Gradient (DDPG) algorithm and is specifically designed for training environments with continuous action spaces.

Figure 4 illustrates the key characteristic of TD3, which lies in the utilization of twin critic networks. The algorithm simultaneously learns two Q-functions,

Q_{ϕ_{1}}

and

Q_{ϕ_{2}}

, by minimizing the mean squared error. Both Q-functions share a single target, and the Q-target is selected as the smaller value between the outputs of the two Q-functions, as described below:

y (r, s^{'}, d) = r + γ (1 - d) min_{i = 1, 2} Q_{ϕ_{i, targ}} (s^{'}, a_{TD 3} (s^{'}))

(6)

Additionally, TD3 incorporates the concept of smoothing to enhance its performance. This is achieved by introducing noise to the target actions, which helps in smoothing the Q-values with respect to changes in actions. By doing so, TD3 makes it more difficult for the policy to exploit errors in the Q-function. The fundamental principle behind target policy smoothing can be summarized as follows:

a_{TD 3} (s^{'}) = clip (μ_{θ, targ} (s^{'}) + clip (ϵ, - c, c), a_{low}, a_{high})

(7)

where

ϵ

is essentially noise, sampled from a normal distribution, i.e.,

ϵ \sim N (0, σ)

. Target policy smoothing is a regularization technique.

4.2. Parameter Design of the TD3 Algorithm’s Basic Structure

In the training process of the TD3 reinforcement learning algorithm within the simulation environment, Figure 4 demonstrates the inputs used, namely the polar coordinates of the agent’s goal, angular velocity

ω

, and linear velocity v. These inputs are combined to form the input state s for the actor network of the TD3 algorithm. The actor network consists of two fully connected (FC) layers, with Rectified Linear Unit (ReLU) activation applied after each layer. The final layer is connected to the output layer, which generates two action parameters:

a_{1}

representing linear velocity and

a_{2}

representing angular velocity of the robot. The output layer employs a hyperbolic tangent (

\tan h

) activation function to ensure the values are constrained within the range of

(- 1, 1)

. Before applying the action to the environment, it is scaled by the maximum linear velocity

v_{m a x}

and maximum angular velocity

ω_{m a x}

to ensure the agent moves in the forward direction.

a = [v_{m a x} (\frac{a_{1} + 1}{2}), ω_{m a x} a_{2}]

(8)

The TD3 algorithm evaluates the Q-value, denoted as

Q (s, a)

, for a given state–action pair using two critic networks. These critic networks share the same structure, but their parameter updates are delayed, allowing for divergence in parameter values. For each critic network, the state–action pair

(s, a)

is provided as input. The state s is passed through a fully connected layer followed by ReLU activation, resulting in the output

L_{s}

. Additionally, the action a is also fed into a fully connected layer, resulting in the transformation of the action denoted as

τ_{1}

and

τ_{2}

for the two critic networks, respectively. The outputs from the transformation layers (

τ_{1}

and

τ_{2}

) are then combined as follows:

L_{c} = L_{s} W_{τ_{1}} + a W_{τ_{2}} + b_{τ_{2}}

(9)

In the network architecture described,

L_{c}

represents the combined fully connected layer (CFC), while

W_{τ_{1}}

and

W_{τ_{2}}

refer to the weights of the

τ_{2}

and

τ_{2}

transformation layers, respectively.

b_{τ_{2}}

represents the bias of the

τ_{2}

layer. After combining the outputs of the transformation layers, a Rectified Linear Unit (ReLU) activation function is applied to the combined layer. This activated layer is then connected to the output layer, which includes a parameter representing the Q-value. To mitigate the issue of overestimating the value of state–action pairs, the minimum Q-value from the two critic networks is selected as the final output of the critic. The complete network architecture, including the actor network and the twin critic networks, is visually presented in Figure 4.

The design of the reward function primarily encompasses three distinct scenarios: if the current number of time steps t to the goal’s distance falls below the threshold

η

, a positive target reward

r_{g o a l}

is applied. In the event of a collision being detected, a negative collision reward

r_{c o l l}

is applied. If neither of these conditions exist, a reward is immediately assigned based on the current linear velocity v and angular velocity

ω

.

r (s_{t}, a_{t}) = \{\begin{matrix} r_{g o a l} & if t < η \\ r_{c o l l} & if collision \\ v - | ω | & otherwise \end{matrix}

(10)

The policy network trained by the TD3 algorithm serves as the required policy evaluation network, denoted as

Q_{π_{g}}

, for the target recognition process. We employ measures, as proposed in the paper [11], to assess the distance between

Q_{π_{g}}

and an observation sequence

O

.

M a x U t i l

represents the cumulative utilities gathered from the observed trajectory.

M a x U t i l (Q_{π_{g}}, O) = \sum_{i \in |O|}^{} Q_{π_{g}} (s_{i}, a_{i})

(11)

5. Experiment Evaluation

5.1. Offline Training of the TD3 Algorithm in ROS

In order to implement the goal recognition algorithm GR_DRL in an online navigation environment in a continuous domain, we conducted experiments on the ROS-based Gazebo platform. Initially, as depicted in Figure 5, training was carried out following the network architecture introduced in Chapter 4. Based on the training parameters we configured, we initiated training for a certain number of epochs. As illustrated in Figure 6, we provide the average and maximum values of Q-values, as well as the reward values. It can be observed that the training has converged, and the policy network obtained from this training serves as the required policy evaluation network, denoted as

Q_{π_{g}}

, for the goal recognition process.

5.2. Testing in a Continuous Domain

Based on the policy evaluation network

Q_{π_{g}}

obtained through training, we acquired the domain-theoretic representation

T_{π} (G)

for navigation tasks in a continuous domain. Subsequently, in the testing map shown in Figure 7a, we established a goal set

G

containing 5 potential goals

(G_{1}, G_{2}, G_{3}, G_{4}, G_{5})

. We conducted tests to assess the recognition accuracy and online recognition speed of the deep reinforcement learning-based goal recognition algorithm in an online continuous environment, considering environmental dynamics, partially observable observation sequences, and noise interference in the observation sequences.

To evaluate recognition accuracy, we utilized common machine learning performance metrics, including

a c c u r a c y

,

p r e c i s i o n

,

r e c a l l

, and

F 1 s c o r e

. These metrics are computed based on the combination of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) instances with respect to the actual class labels and the corresponding model predictions, as described by the evaluation formulas.

A c c u r a c y

is defined as the ratio of correctly classified samples to the total number of samples. The corresponding formula is:

A C C = \frac{T P + T N}{T P + T N + F P + F N}

(12)

P r e c i s i o n

is defined as the ratio of true positives to the total number of predicted positive samples. The corresponding formula is:

P r e = \frac{T P}{T P + F P}

(13)

R e c a l l

is defined as the ratio of true positives to the total number of actual positive samples. The corresponding formula is:

R e c = \frac{T P}{T P + F N}

(14)

F 1 - s c o r e

is derived from the harmonic mean of precision and recall. The corresponding formula is:

F 1 = \frac{2 P r e \times R e c}{P r e + R e c}

(15)

5.2.1. Testing under Partial Observability

This study examines the impact of communication constraints on observation sequence missingness during online goal recognition. Specifically, we investigate two distinct sources of partial observability: varying sampling rates and sensor failures. The key differentiation lies in the uniformity of partial observability caused by varying sampling rates, while temporary sensor failures result in continuous periods of missing data. To comprehensively evaluate this phenomenon, we conducted experiments considering five different levels of observability (5%, 10%, 30%, 50%, and full observability). Additionally, random obstacles were introduced to assess the algorithm’s robustness in both dynamic and static environmental conditions. Each experiment was repeated 100 times, and the resulting data are presented in Table 1. These findings contribute valuable insights into the algorithm’s performance under communication constraints and shed light on the varying levels of observability.

According to the observations, the recognition algorithm demonstrates remarkable accuracy, surpassing 90%, when confronted with both types of observation missingness (varying sampling rates and sensor failures) in static environments. Even in dynamic environments, where some challenges are present, the algorithm exhibits a commendable recognition accuracy of over 80%, showcasing its robustness in coping with the complexities of changing environmental conditions.

5.2.2. Testing under Observation Sequence Noise

In the realm of online goal recognition tasks, communication frequently contends with diverse degrees of noise interference. Conventional forms of communication noise are typically categorized into three primary classes: Gaussian white noise, Laplace noise, and Poisson noise. Within the scope of this investigation, precision assessments were executed on observation sequences exhibiting distinct signal-to-noise ratios (SNRs). These assessments encompassed two distinct observability states: 50% and full observability. The evaluations took into account both dynamic and static environmental scenarios. The SNR calculation formula is articulated as follows:

d b = 10 log (\frac{s}{n})

(16)

In this context, s represents the signal, and n represents the noise. The signal-to-noise ratio (SNR) is measured in decibels (dB) and is computed as follows:

d b = 10 log (\frac{P_{s}}{P_{n}})

(17)

where

P_{s}

and

P_{n}

respectively denote the effective power of the signal and noise. In general, a higher signal-to-noise ratio (SNR) corresponds to higher audio playback quality, indicating that the noise mixed with the signal is lower. Conversely, a lower SNR results in lower audio quality. To ensure the quality of sound propagation, the SNR should typically not fall below 70 dB. However, to simulate robust adversarial scenarios, in the experiments of this section, extreme cases of 80 dB, 50 dB, and 10 dB were selected for analysis. The experimental results are presented in Table 2.

It can be observed that across the tests involving three different types of noise, in both dynamic and static environmental conditions, the recognition accuracy consistently exceeds 60%. Furthermore, under various levels of signal-to-noise ratio (SNR) interference, the overall accuracy exhibits relatively minor fluctuations. In the extreme scenario with 50% observability and an SNR of 10 dB, Gaussian noise, Laplace noise, and Poisson noise all achieve recognition accuracies of 88%, 78%, and 61% or higher, respectively.

5.2.3. Online Recognition Speed Testing

The recognition speed in online goal recognition, especially in the presence of communication interference, is an essential metric to consider. We conducted tests on the online recognition speed under two different environmental conditions (static and dynamic) for five levels of partially observable scenarios (5%, 10%, 30%, 50%, and full observability), where observability is affected by varying sampling rates. In this study, the successful recognition time was determined as the time taken to correctly identify the target within three consecutive time steps. The experimental results are presented in Table 3.

Additionally, we evaluated the online recognition speed under the influence of Gaussian noise interference at three different signal-to-noise ratios (10 dB, 50 dB, 80 dB). These tests were conducted for two observability levels (50% and full observability). The experimental results are shown in Table 4.

According to Table 3, it is evident that the impact of five different levels of missing observation sequences on online recognition speed remains within one second. Furthermore, the influence of environmental conditions, both static and dynamic, on recognition speed is also quite limited.

Moreover, from Table 4, it can be observed that the impact of Gaussian noise under different observability levels (50% and full observability) on recognition speed can be considered negligible. Similarly, the influence of noise at different signal-to-noise ratios (10 dB, 50 dB, 80 dB) on recognition speed is kept within 1 second. Even in scenarios with partial observability and Gaussian noise interference, the recognition accuracy still maintains a level exceeding 90%.

6. Discussion and Conclusions

In this paper, we presented a continuous-domain online real-time goal recognition algorithm based on the TD3 deep reinforcement learning algorithm. The algorithm was validated through experiments conducted on the ROS-based Gazebo simulation platform. We examined the robustness of the goal recognition algorithm under various communication constraints, including partial observability and observation noise, and also evaluated the online recognition speed.

Our approach extends traditional reinforcement learning-based goal recognition algorithms to the continuous domain, addressing the challenge of representing all action–state pairs when the action and state spaces are infinite. By employing deep reinforcement learning, we learned the agent’s policy evaluation network, denoted as

Q_{π_{g}}

, which, together with the continuous action space

A

and state space

S

, forms a domain-theoretic representation

T_{π} (G)

based on neural networks. This representation successfully captures the Q-values for all action–state pairs in continuous spaces. Subsequently, we employed the

M a x U t i l

method proposed in Goal Recognition As Q-Learning [11] to infer the most likely goal

g^{*}

based on the maximum Q-value.

Differing from generative approaches proposed by planning-based goal recognition methods, our reinforcement learning-based algorithm adopts an evaluative approach, implicitly mitigating the impact of observation noise and sequence missingness on recognition accuracy. Our experiments also demonstrated the algorithm’s excellent resistance to disturbances and robustness in real-world, communication-constrained scenarios. Furthermore, our offline learning and online inference approach significantly enhances the speed of online recognition, compared to planning-based recognition methods that require re-invoking planners for every new observation sequence.

In summary, our work represents a significant advancement in the field of goal recognition. While this paper primarily focuses on goal recognition in continuous navigation environments, the proposed approach can be applied to a wide range of tasks in continuous domains in real-world environments. Examples include goal inference for robotic arm grasping tasks and tactical maneuvers for fighter aircraft. As long as we employ deep reinforcement learning to train a policy network model tailored for the respective task during offline learning and utilize the framework introduced in this paper to construct domain-theoretic representations

T_{π} (G)

based on neural networks, we can successfully address various online goal recognition tasks. The paper [4,5] demonstrates the extensive prospects of goal recognition in the field of autonomous driving. The approach presented in this paper exhibits excellent performance on a continuous domain platform based on Gazebo. Future research could consider testing and applying it on autonomous driving simulation platforms like Carla.

Author Contributions

Conceptualization, Z.F. and K.X.; methodology, Z.F. and Y.Z.; validation, Z.F. and D.C.; formal analysis, Z.F.; investigation, K.X.; data curation, Z.F.; writing—original draft preparation, Z.F.; writing—review and editing, Z.F., D.C. and T.W.; supervision, K.X. and T.W.; funding acquisition, K.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of China (grant number 62103420).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data supporting the reported results can be found at https://github.com/pucrs-automated-planning/graql_aaai, Deep reinforcement learning algorithms can be found at https://github.com/datawhalechina/easy-rl and the ROS experimental environment can be found at https://www.theconstructsim.com/robotigniteacademy_learnros/ros-courses-library/using-openai-with-ros-online-course/, all accessed on 29 August 2023.

Acknowledgments

We thank Felipe Meneguzzi of the University of Aberdeen for support with the open source code.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GR_DRL	Goal Recognition as Deep Reinforcement Learning
GR	Goal Recognition
PRAP	Plan Recognition as Planning
PDDL	Planning Domain Definition Language
XGBoost	eXtreme Gradient Boosting
LSTM	Long Short-Term Memory networks
TD3	Twin Delayed Deep Deterministic Policy Gradient
DDPG	Deep Deterministic Policy Gradient
FC	Fully Connected
ReLU	Rectified Linear Unit
CFC	Combined Fully Connected Layer
SNR	Signal-to-Noise Ratio

References

Sukthankar, G.; Goldman, R.; Geib, C.; Pynadath, D.; Bui, H. Plan, Activity, and Intent Recognition: Theory and Practice; Elsevier: Amsterdam, The Netherlands, 2014; pp. 1–385. [Google Scholar]
Geib, C.W. Problems with Intent Recognition for Elder Care. In Proceedings of the AAAI-02 Workshop “Automation as Caregiver”, Edmonton, AB, Canada, 29 July 2002; pp. 13–17. [Google Scholar]
Granada, R.; Pereira, R.F.; Monteiro, J.; Barros, R.; Ruiz, D.; Meneguzzi, F. Hybrid Activity and Plan Recognition for Video Streams. In Proceedings of the 31st AAAI Conference: Plan, Activity and Intent Recognition Workshop, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Brewitt, C.; Gyevnar, B.; Garcin, S.; Albrecht, S.V. GRIT: Fast, Interpretable, and Verifiable Goal Recognition with Learned Decision Trees for Autonomous Driving. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021. [Google Scholar]
Brewitt, C.; Tamborski, M.; Albrecht, S.V. Verifiable Goal Recognition for Autonomous Driving with Occlusions. arXiv 2022, arXiv:2206.14163. [Google Scholar]
Xu, K.; Yin, Q. Goal Identification Control Using an Information Entropy-Based Goal Uncertainty Metric. Entropy 2019, 21, 299. [Google Scholar] [CrossRef] [PubMed]
Sohrabi, S.; Riabov, A.V.; Udrea, O. Plan Recognition as Planning Revisited. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), New York, NY, USA, 9–15 July 2016; pp. 3258–3264. [Google Scholar]
Fitzpatrick, G.; Lipovetzky, N.; Papasimeon, M.; Ramirez, M.; Vered, M. Behaviour Recognition with Kinodynamic Planning Over Continuous Domains. Front. Artif. Intell. 2021, 4, 717003. [Google Scholar] [CrossRef] [PubMed]
Wayllace, C.; Ha, S.; Han, Y.; Hu, J.; Monadjemi, S.; Yeoh, W.; Ottley, A. DRAGON-V: Detection and Recognition of Airplane Goals with Navigational Visualization. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13642–13643. [Google Scholar] [CrossRef]
Ramirez, M.; Geffner, H. Plan Recognition as Planning. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, CA, USA, 11–17 July 2009; p. 6. [Google Scholar]
Amado, L.; Mirsky, R.; Meneguzzi, F. Goal Recognition as Reinforcement Learning. arXiv 2022, arXiv:2202.06356. [Google Scholar] [CrossRef]
Silver, T.; Chitnis, R. PDDLGym: Gym Environments from PDDL Problems. arXiv 2020, arXiv:2002.06432. [Google Scholar]
Vered, M.; Kaminka, G.A. Online Recognition of Navigation Goals Through Goal Mirroring (Extended Abstract). In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, São Paulo, Brazil, 8–12 May 2017; p. 3. [Google Scholar]
Vered, M.; Kaminka, G.A. Heuristic Online Goal Recognition in Continuous Domains. arXiv 2017, arXiv:1709.09839. [Google Scholar]
Meneguzzi, F.; Fraga Pereira, R. A Survey on Goal Recognition as Planning. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 19–26 August 2021; pp. 4524–4532. [Google Scholar] [CrossRef]
Van-Horenbeke, F.A.; Peer, A. Activity, Plan, and Goal Recognition: A Review. Front. Robot. AI 2021, 8, 643010. [Google Scholar] [CrossRef] [PubMed]
Mirsky, R.; Keren, S.; Geib, C. Introduction to Symbolic Plan and Goal Recognition; Morgan & Claypool Publishers: San Rafael, CA, USA, 2021; pp. 1–100. [Google Scholar]
Pereira, R.F.; Oren, N.; Meneguzzi, F. Landmark-Based Approaches for Goal Recognition as Planning. Artif. Intell. 2020, 279, 103217. [Google Scholar] [CrossRef]
Amado, L.; Paludo Licks, G.; Marcon, M.; Fraga Pereira, R.; Meneguzzi, F. Using Self-Attention LSTMs to Enhance Observations in Goal Recognition, Glasgow, United Kingdom. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
Amado, L.; Pereira, R.F.; Aires, J.; Magnaguagno, M.; Granada, R.; Meneguzzi, F. Goal Recognition in Latent Space. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar] [CrossRef]
Zhi-Xuan, T.; Mann, J.; Silver, T.; Tenenbaum, J.; Mansinghka, V. Online Bayesian Goal Inference for Boundedly Rational Planning Agents; Curran Associates, Inc.: Dutchess County, NY, USA, 2020; Volume 33, pp. 19238–19250. [Google Scholar]
Masters, P.; Sardina, S. Cost-Based Goal Recognition for Path-Planning. In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, São Paulo, Brazil, 8–12 May 2017; p. 9. [Google Scholar]
Oh, J.; Meneguzzi, F.; Sycara, K.; Norman, T.J. An Agent Architecture for Prognostic Normative Reasoning. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011. [Google Scholar]
Avrahami-Zilberbrand, D.; Kaminka, G.A. Keyhole Adversarial Plan Recognition for Recognition of Suspicious and Anomalous Behavior; Elsevier: Amsterdam, The Netherlands, 2014; pp. 87–119. [Google Scholar] [CrossRef]
Xu, K.; Zeng, Y.; Qin, L.; Yin, Q. Single Real Goal, Magnitude-Based Deceptive Path-Planning. Entropy 2020, 22, 88. [Google Scholar] [CrossRef] [PubMed]
Kaminka, G.; Vered, M.; Agmon, N. Plan Recognition in Continuous Domains. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar] [CrossRef]
Borrajo, D.; Gopalakrishnan, S.; Potluru, V.K. Goal Recognition via Model-Based and Model-Free Techniques. arXiv 2020, arXiv:2011.01832. [Google Scholar]
Masters, P.; Vered, M. What’s the Context? Implicit and Explicit Assumptions in Model-Based Goal Recognition, Montreal, Canada. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Montreal, QC, Canada, 19–27 August 2021; pp. 4516–4523. [Google Scholar] [CrossRef]
Amir, E.; Chang, A. Learning Partially Observable Deterministic Action Models. J. Artif. Intell. Res. 2008, 33, 349–402. [Google Scholar] [CrossRef]
Asai, M.; Muise, C. Learning Neural-Symbolic Descriptive Planning Models via Cube-Space Priors: The Voyage Home (to STRIPS). arXiv 2020, arXiv:2004.12850. [Google Scholar]
Juba, B.; Le, H.S.; Stern, R. Safe Learning of Lifted Action Models. arXiv 2021, arXiv:2107.04169. [Google Scholar]
Zeng, Y.; Xu, K.; Yin, Q.; Qin, L.; Zha, Y.; Yeoh, W. Inverse Reinforcement Learning Based Human Behavior Modeling for Goal Recognition in Dynamic Local Network Interdiction. In Proceedings of the AAAI Workshops, New Orleans, LA, USA, 2–7 February 2018; p. 7. [Google Scholar]
Durga, K.M.L.; Jyotsna, P.; Kumar, G.K. A Deep Learning based Human Activity Recognition Model using Long Short Term Memory Networks. In Proceedings of the 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), Erode, India, 7–9 April 2022; pp. 1371–1376. [Google Scholar] [CrossRef]
Chiari, M.; Gerevini, A.E.; Putelli, L.; Percassi, F.; Serina, I. Goal Recognition as a Deep Learning Task: The GRNet Approach. arXiv 2022, arXiv:2210.02377. [Google Scholar] [CrossRef]

Figure 1. Continuous and discrete domains for the navigation environment: (a) discrete domain; (b) continuous domain.

Figure 2. This figure illustrates the online goal recognition framework based on reinforcement learning in continuous domains. The red text highlights the key characteristics of the goal recognition algorithm in online continuous domains, particularly in terms of the comparison between discrete and continuous domains regarding action space and evaluation methods.

Figure 3. The proposed framework for GR as DRL.

Figure 4. Network structure diagram of the TD3 algorithm.

Figure 5. Navigation environment in ROS.

Figure 6. Q-value convergence curve. The x-axis represents the number of training iterations, measured in units of rounds, while the y-axis represents the reward Q-value.

Figure 7. Testing in a continuous domain.

Table 1. Impact of partial observability (5%, 10%, 30%, 50%, and full observability).

Partial OBS		acc		prec		rec		f-s
Partial OBS	Type	Dynamic	Statics	Dynamic	Statics	Dynamic	Statics	Dynamic	Statics
5%	sampling rates	0.90	0.94	0.74	0.86	0.74	0.86	0.74	0.86
10%	sampling rates	0.92	0.96	0.79	0.89	0.79	0.89	0.79	0.89
30%	sampling rates	0.92	0.96	0.79	0.89	0.79	0.89	0.79	0.89
50%	sampling rates	0.92	0.96	0.79	0.89	0.79	0.89	0.79	0.89
100%	sampling rates	0.94	0.96	0.84	0.89	0.84	0.89	0.84	0.89
avg		0.92	0.95	0.79	0.88	0.79	0.88	0.79	0.88
5%	sensor failures	0.82	0.83	0.56	0.58	0.58	0.60	0.57	0.59
10%	sensor failures	0.86	0.89	0.66	0.72	0.66	0.72	0.66	0.72
30%	sensor failures	0.87	0.90	0.68	0.76	0.68	0.76	0.68	0.76
50%	sensor failures	0.87	0.90	0.68	0.76	0.68	0.76	0.68	0.76
100%	sensor failures	0.92	0.96	0.79	0.89	0.79	0.89	0.79	0.89
avg		0.87	0.90	0.67	0.74	0.68	0.75	0.68	0.74

Table 2. Impact of noise observability (Gaussian noise, Laplace noise, Poisson noise).

Noise OBS			acc		prec		rec		f-s
Noise OBS	Type	SNR	Dynamic	Statics	Dynamic	Statics	Dynamic	Statics	Dynamic	Statics
Gaussian noise	full obs	10 dB	0.92	0.96	0.81	0.90	0.81	0.90	0.81	0.90
		50 dB	0.93	0.97	0.82	0.93	0.82	0.93	0.82	0.93
		80 dB	0.93	0.97	0.82	0.93	0.82	0.93	0.82	0.93
Gaussian noise	50% obs	10 dB	0.88	0.92	0.69	0.80	0.69	0.80	0.69	0.80
		50 dB	0.92	0.97	0.81	0.92	0.81	0.92	0.81	0.92
		80 dB	0.92	0.97	0.81	0.93	0.81	0.93	0.81	0.93
Laplace noise	full obs	10 dB	0.83	0.92	0.57	0.80	0.57	0.80	0.57	0.80
		50 dB	0.84	0.94	0.60	0.84	0.60	0.84	0.60	0.84
		80 dB	0.86	0.96	0.64	0.90	0.64	0.90	0.64	0.90
Laplace noise	50% obs	10 dB	0.78	0.88	0.46	0.71	0.46	0.71	0.46	0.71
		50 dB	0.82	0.91	0.54	0.77	0.54	0.77	0.54	0.77
		80 dB	0.84	0.94	0.59	0.84	0.59	0.84	0.59	0.84
Poisson noise	full obs	10 dB	0.86	0.95	0.64	0.87	0.64	0.87	0.64	0.87
		50 dB	0.93	0.96	0.82	0.90	0.82	0.90	0.82	0.90
		80 dB	0.94	0.97	0.84	0.93	0.84	0.93	0.84	0.93
Poisson noise	50% obs	10 dB	0.61	0.60	0.02	0.00	0.02	0.00	0.02	0.00
		50 dB	0.92	0.60	0.81	0.00	0.81	0.00	0.81	0.00
		80 dB	0.93	0.95	0.82	0.87	0.82	0.87	0.82	0.87

Table 3. Online goal recognition speed under partial observability.

Online Recognition	acc		Time(s)
Online Recognition	Dynamic	Statics	Dynamic	Statics
5%	0.90	0.94	10.68	10.17
10%	0.92	0.96	10.55	9.90
30%	0.92	0.96	10.54	9.73
50%	0.92	0.96	10.53	9.67
100%	0.94	0.96	10.51	9.66
avg	0.92	0.95	10.56	9.83

Table 4. Online goal recognition speed under Gaussian noise.

Gaussian Noise		acc		Time(s)
Gaussian Noise	Partial OBS	Dynamic	Statics	Dynamic	Statics
10 db	full obs	0.92	0.96	11.00	10.30
50 db		0.93	0.97	10.86	10.12
80 db		0.93	0.97	10.63	10.00
avg		0.93	0.97	10.83	10.14
10 db	50% obs	0.92	0.96	11.18	10.83
50 db		0.93	0.97	11.17	10.72
80 db		0.93	0.97	11.07	10.72
avg		0.93	0.97	11.14	10.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fang, Z.; Chen, D.; Zeng, Y.; Wang, T.; Xu, K. Real-Time Online Goal Recognition in Continuous Domains via Deep Reinforcement Learning. Entropy 2023, 25, 1415. https://doi.org/10.3390/e25101415

AMA Style

Fang Z, Chen D, Zeng Y, Wang T, Xu K. Real-Time Online Goal Recognition in Continuous Domains via Deep Reinforcement Learning. Entropy. 2023; 25(10):1415. https://doi.org/10.3390/e25101415

Chicago/Turabian Style

Fang, Zihao, Dejun Chen, Yunxiu Zeng, Tao Wang, and Kai Xu. 2023. "Real-Time Online Goal Recognition in Continuous Domains via Deep Reinforcement Learning" Entropy 25, no. 10: 1415. https://doi.org/10.3390/e25101415

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Online Goal Recognition in Continuous Domains via Deep Reinforcement Learning

Abstract

1. Introduction

2. Background

2.1. Goal Recognition Problem

2.2. Goal Recognition as Planning

2.3. Goal Recognition as Learning

3. The Goal Recognition as Deep Reinforcement Learning Framework

4. Goal Recognition as TD3

4.1. Basic Principles of the TD3 Algorithm

4.2. Parameter Design of the TD3 Algorithm’s Basic Structure

5. Experiment Evaluation

5.1. Offline Training of the TD3 Algorithm in ROS

5.2. Testing in a Continuous Domain

5.2.1. Testing under Partial Observability

5.2.2. Testing under Observation Sequence Noise

5.2.3. Online Recognition Speed Testing

6. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI