1. Introduction
One of the most important sub-tasks of autonomous navigation, especially navigation bound to an unstructured environment, is an intelligent agent’s ability to perform collision-free path planning. It is paramount that the robot’s surroundings are understood to its best ability, such that it can safely navigate in difficult environments. Traditional approaches rely on heuristic rules of handcrafted features for solving the path planning task. However, these methods may not extend well enough in the context of dynamic environments.
With the accelerated development of computing hardware, the Artificial Intelligence domain has also seen an enormous rise in popularity. As a consequence, an alternative approach to self-driving tasks is based on AI algorithms, commonly implemented as deep neural networks. These algorithms, which are usually trained using sensory data acquired from the autonomous agent, show potential in generalizing to new and/or evolving environments.
Some AI-based approaches to the autonomous driving challenge divide the navigation problem into smaller sub-problems and approach them individually. Others model the entire navigation task as a black-box, expecting sensor data inputs and directly outputting vehicle commands. Some of these algorithms use RGB image data in order to detect obstacles in the vicinity of the autonomous agent and plan collision-free trajectories based on the detected obstacles.
In this paper, we propose a different approach for tackling the autonomous navigation problem, by using a recurrent neural network to model the dynamics of the environment in which the robot navigates. The environment dynamics, in the context of this work, refer to dynamic obstacles present in the scene. Recurrent neural networks are a type of model that can capture temporal dependencies and process sequence data, like objects moving within the scene. By incorporating historic data in the form of previous sensor observations from the mobile robot, a Recurrent Neural Network (RNN) can improve the accuracy and robustness of a collision prediction algorithm, especially in complex, unstructured environments.
We have chosen a recurrent neural network architecture due to its ability to capture temporal dependencies within the input data. This enables the encoding of obstacle dynamics within its layers and allows for using this additional information in the path planning component in order to generate a safer trajectory for the mobile robot.
We conducted our experiments with data collected while the mobile robot shown in
Figure 1a was navigating on a forest road, with dynamic obstacles present in its vicinity. Dynamic obstacles were represented by human subjects approaching the mobile robot while it navigated in the forest road environment. Our experiments indicate a better performance than the baseline.
The main contributions of this paper are as follows:
The deep neural network architecture used to encode the dynamics of the robot’s environment and predict future sensor observations based on previously gathered information.
The self-supervised training of our encoder network with processed sensor information from an RGB-D camera.
The processing of the acquired RGB-D data in the scope of reducing the search space of the planning algorithm.
The usage of a custom asymmetrical loss function for training the neural network.
The path planning component, which incorporates the predicted observations when generating safe trajectories for the mobile robot.
The rest of the work is organized as follows. The related work is covered in
Section 2. Our proposed vision dynamics algorithm is presented in
Section 3. The experimental setup, a detailed description, and comparisons with other algorithms are presented in
Section 4. The conclusions are presented in the final section.
2. Related Work
In recent years, the development of autonomous navigation solutions has seen a rise in popularity in the scientific community, mainly due to the advances of artificial intelligence. Several papers which explore various aspects of the autonomous navigation task, like perception and control, have been published.
Some research treats the entire autonomous navigation problem as a black-box. The proposed algorithms expect sensory information as input to a deep learning model and directly output control signals. For example, one such approach can be found in reference [
1], where the authors train a convolutional neural network to directly map pixel information from a monocular camera to vehicle steering commands. This approach tries to simplify the autonomous navigation problem by eliminating the necessity of individual sub-systems like lane marking detection or path planning. Another approach uses the Deep Reinforcement Learning (DRL) technique, where the agent is controlled via action–reward systems and is stimulated to take actions to advance toward its goal, as in [
2].
Both End-to-End and DRL approaches show promising results in a controlled simulated environment; however, the deployment of such algorithms in real-world scenarios is difficult, due to the high unpredictability of the environment dynamics.
One example of Deep Reinforcement Learning being used in an unstructured navigation task is the work presented in [
3]. The authors used a simulated environment to perform the training of their model, enabling the system to learn navigation policies and then perform transfer learning as
sim-to-real.
Another example of Reinforcement Learning employed to solve the unstructured navigation problem is presented in [
4]. Their proposal contains a framework based on an uncertainty-aware learning based model and a self-supervised traversability estimation model. The traversability model is then integrated with a model predictive controller, which optimizes the vehicle’s trajectory while simultaneously avoiding obstacles present in the environment.
Other work relies on object detection algorithms in order to identify obstacles from the scene and to generate collision-free trajectories for the autonomous agent. Given the large number of publicly available driving databases, object detection based on image data is a widely researched area as well. In [
5], the authors propose training a state-of-the-art object detector algorithm in the hope of improving performance for self-driving scenarios. Wang et al. proposed a collision detection mechanism based on a convolutional neural network in [
6] within the context of autonomous driving. Pramanik et al. proposed an object detection and tracking system which handles video sequences in [
7].
More recently, in [
8], an obstacle avoidance method was proposed based on object detection, with certain improvements in their model, like in the variable learning rate.
Object detection algorithms have high computational requirements and necessitate various optimizations and improvements before being able to run in real-time. They also face issues such as robustness to occlusions, illumination changes, and object deformation.
Another approach to the autonomous driving challenge, particularly to the collision avoidance strategy, is based on occupancy grids, where the environment is represented as a grid of cells, each cell having an assigned probability that indicates whether or not the respective cell is occupied. Shepei et al. proposed an occupancy grid generation algorithm based on stereo image input in [
9]. Their architecture is based on ResNet, trained to highlight static and dynamic obstacles within the scene.
Another type of occupancy grid generation algorithm makes use of LiDAR data instead of stereo cameras, such as the one proposed in [
10], where the authors integrate this algorithm into a quadrotor system and perform benchmark flights. Usually, algorithms that handle LiDAR data suffer from high computation costs and limited scalability, as they require the processing of large amounts of data.
Recurrent neural networks have also been in the spotlight for autonomous navigation solutions. For example, the authors of [
11] present a multi-step prediction of drivable space in the form of occupancy grid sequences, using an RNN architecture that was trained on the KITTI dataset, in the scope of learning motion-related features from the input sequences.
There are also approaches to the unstructured navigation problem which do not rely on artificial intelligence. One example of such a work is [
12], where the authors use
A* for calculating the global trajectory for their robotic platform, combined with a local planner and a reactive planner to ensure that the robot will keep track of the computed global trajectory.
A similar approach for a legged robot without using artificial intelligence is presented in [
13]. In this study, the authors propose an end-to-end path planning and navigation system which is used by the robot to navigate in its’ environment. The legged robot’s trajectory is generated with the use of a nonlinear optimization method.
The method proposed in [
14] uses LiDAR data, coupled with an autonomous mapless navigation method based on [
15] in order to perform legged robot navigation in an unstructured, urban environment. This approach combines ranging sensing with Deep Reinforment Learning and classical path planning in order for the robot to successfully avoid the dynamic obstacles present in an urban environment.
These works demonstrate the potential of using object detection and deep-learning-based approaches for obstacle avoidance in dynamic environments. However, they still face challenges such as robustness to occlusions, illumination changes, and object deformation. The training process for these object detectors is label-intensive as well. Additionally, some of the methods use expensive sensors (e.g., LiDAR) to map the mobile robot’s environment and avoid collisions with obstacles.
Basing our work on [
16,
17,
18], we propose a vision dynamics approach for solving the unstructured navigation problem for a legged mobile robot, based on processing information coming from an RGB-D sensor through the use of a recurrent neural network which predicts future sensory observations and feeds these observations into a modified DWA algorithm in order to produce an obstacle-free trajectory for the mobile robot. We demonstrate the effectiveness of the approach in a real-world scenario during a navigation task on a forest road.
3. Methodology
The following notation is used throughout this paper. Superscript <t> is used to denote a single discrete time step t. For example, a depth sensor measurement at time step t will be written as ${\theta}^{<t>}$. Predicted values are defined with the hat notation. A predicted observation at time step $t+n$ is defined as ${\widehat{\theta}}^{<t+n>}$. Vectors are represented by bold symbols. A vector of sensor observations is marked as ${\Theta}^{<t-\tau ,t-1>}$, which represents a series of sensor observations from time step $t-\tau $ to time step $t-1$.
3.1. Problem Definition
Given a set of historical depth sensing observations ${\Theta}^{<t-\tau ,t-1>}$, the current sensor observation ${\theta}^{<t>}$, the mobile robot’s current state ${s}^{<t>}=[{x}^{<t>},{y}^{<t>},{v}^{<t>},{\varphi}^{<t>}]$, and a global reference trajectory ${\mathbf{Z}}_{ref}^{<t-\infty ,t+\infty >}$, the objective is to construct a recurrent deep neural network model which can encode within its layers the dynamics of the environment and, based on historical sensor observations, can predict future observations for a finite time horizon $[t+1,...,t+n]$, where each time step $t+1,t+2,...$ represents one future sampling time, bound by a time constraint of 100 ms. The agent’s belief, modelled with the neural network, is passed to the path planner algorithm based on the Dynamic Window Approach and a collision-free local trajectory is generated for the mobile robot: ${\mathbf{Z}}_{L}^{<t+1,t+n>}$.
Both local and global trajectories are represented as sequences of vehicle states: ${\mathbf{Z}}_{ref}=\left\{[{x}_{i},{y}_{i},{v}_{i},{\varphi}_{i}]\right|\forall i\in [1,...,N]\}$ and, respectively, ${\mathbf{Z}}_{L}=\left\{[{x}_{i},{y}_{i},{v}_{i},{\varphi}_{i}]\right|\forall i\in [1,n]\}$, where ${x}_{i}$ and ${y}_{i}$ are the Cartesian coordinates of the vehicle, ${v}_{i}$ the velocity, and ${\varphi}_{i}$ is the orientation angle of the vehicle with respect to its initial orientation. n is the length of the generated local trajectory, ${\mathbf{Z}}_{ref}$ is the global reference trajectory, and ${\mathbf{Z}}_{L}$ is the generated local trajectory.
3.2. Scene Dynamics Encoder Network
The architecture of the encoder model is presented in
Figure 2. The observations coming from the RGB-D sensor are not directly processed by the network, but undergo a series of preprocessing operations beforehand, as described in Equations (
2)–(
4). The input sequence (historic observations) is accumulating in the
Encoder memory unit, formatted appropriately by the
Sequence formatter block, and then passed through to the recurrent neural network. The
Clustering unit clusters the predictions coming from the neural network and eliminates outliers. Encoding scene dynamics involves processing data coming from the RGB-D sensor in order to create a representation of the current scene, as well as predicting the evolution of the scene over a fixed time period. LSTM networks are well suited for encoding scene dynamics due to their ability to learn the temporal dependencies of the input data and use these dependencies for predicting future states.
The encoder model is based on an LSTM architecture which contains multiple cells for processing time-series data in the form of historical depth sensor observations of length $\tau $. The output of the model is a sequence of future observations of length n.
Depth information coming from the RGB-D sensor is not fed directly into the network, but it is processed beforehand.
First, using depth measurements from the sensor and, respectively, the depth camera intrinsic matrix, the three-dimensional environment is recreated:
where
${X}_{i}$,
${Y}_{i}$, and
${Z}_{i}$ are the three-dimensional coordinates of the points
P;
${D}_{img}$ is the depth image. For every point in the depth image domain, corresponding
$3D$ points are calculated:
where
j is the column coordinate of the current pixel from the depth image,
i is the row coordinate of the current pixel from the depth image,
${c}_{x}$ and
${c}_{y}$ are the optical center coordinates,
${f}_{x}$ and
${f}_{y}$ are the focal distance coordinates, and
z is the depth value at coordinates
$(i,j)$.
Second, only candidate points that are closer than a threshold value are kept:
where
p is the filtered point,
${p}_{c}$ is the candidate point,
o is the vehicle coordinate system origin,
${T}_{min}$ is the minimum threshold value,
${T}_{max}$ is the maximum threshold value, and
d is the L2 norm, calculated as
$\left|\right|o-{p}_{c}{\left|\right|}^{2}$.
Third, points that are below or above a certain height are eliminated:
where
p is the filtered point,
${p}_{c}$ is the candidate point,
${T}_{min}$ is the minimum threshold value,
${T}_{max}$ is the maximum threshold value, and
${Y}_{{p}_{c}}$ is the height of the candidate point
${p}_{c}$. Eliminating points below the minimum threshold will also eliminate points that belong to the ground, therefore easing further computation steps.
The resulting point cloud is then converted into a voxel representation in order to reduce the input data dimension:
where
$v(i,j,k)$ is a 3D space within the voxel grid.
Furthermore, the voxel representation is then projected onto a top-view occupancy grid:
These two-dimensional grids accumulate into a historical sequence that is then fed as input to the sequence formatter module from our algorithm.
A visual representation of the processing chain applied to the input depth images can be seen in
Figure 3. The history sequences
${\mathbf{V}}_{2D}^{<t-\tau ,t-1>}$ are, again, processed by the sequence formatter module, which reshapes them to a suitable format for the LSTM module of our algorithm.
At the heart of the scene dynamics encoder model there is a stacked LSTM module with n layers. The output of each layer is then fed through a series of fully connected layers in order to produce the predictions at time steps $[t+1,...,t+n]$.
The resulting predictions
${\widehat{\mathbf{V}}}_{\mathbf{2}\mathbf{D}}^{<t+1,t+n>}$ coming from the neural network, in the form of predicted voxel grids for future time steps
$[t+1,...,t+n]$, are passed to the
clustering unit for further processing. The clustering unit takes each predicted observation
${V}_{2D}^{<t+i>}$ and performs DBSCAN, according to [
19]. The output of this layer is, for each time step
$t+i$, a list of centroids plus the object size corresponding to each obstacle present in the current time step. Obstacle points are sampled based on the centroids and sizes provided by the clustering unit for each predicted time step
$t+i$ and are accumulated in order to be fed to the DWA algorithm.
The clustering unit also acts as a filter mechanism for the output data, using the Log Odds method.
The data processing main workflow is presented in Algorithm 1.
Algorithm 1 Processing loop | |
- Require:
${\Theta}^{<t-\tau ,t-1>}$ - Require:
${\theta}^{<t>}$ - Require:
$obstacles=\left\{\right\}$ - 1:
for
$i\in [t-\tau ,t]$
do - 2:
${P}_{3D}^{<i>}\leftarrow $ to_pcl(${\theta}^{<i>}$) - 3:
${P}_{3D}^{<i>}\leftarrow $ reject_outliers(${P}_{3D}^{<i>}$) - 4:
${V}_{3D}^{<i>}\leftarrow $ to_3d_voxel(${P}_{3D}^{<i>}$) - 5:
${V}_{2D}^{<i>}\leftarrow $ project_2d(${V}_{3D}^{<i>}$) - 6:
end for - 7:
$m\leftarrow $$\left\{{V}_{2D}^{<i>}\right|\forall i\in [t-\tau ,t-1]\}$ - 8:
${\widehat{\mathbf{V}}}_{2D}^{<t+1,t+n>}\leftarrow $$m\left({\mathbf{V}}_{2D}^{<t-\tau ,t-1>}\right)$ - 9:
${\mathbf{c}}_{j}\leftarrow $ cluster_outputs(${\widehat{\overrightarrow{v}}}_{2D}^{<t+1,t+n>}$, ${V}_{2D}^{<t>}$) - 10:
for
$j\in [0,{c}_{len}]$
do - 11:
$obstacles\leftarrow {c}_{j}$ - 12:
end for
| |
The RNN has been trained in a self-supervised way, over a number of 1000 epochs, using an Adam optimizer with a learning rate of 0.001. The history size used for training was $\tau =50$ and the prediction horizon size was $n=30$. This is equivalent to 5 s for the history sequence and 3 s for prediction, respectively. For training the network, we used a total number of 30 K pairs of input historical depth observations and target predictions.
A custom asymmetrical loss function was used, as shown in Equation (
8), because of the sparse nature of the output sequence data.
where
$\alpha $ is a scale factor that increases the loss value for predictions that diverge from target values and
${1}_{+}$ is a function defined as follows:
After experimenting with the architecture and the loss function, the scale factor $\alpha $ was set to $0.1$.
A visual representation of the asymmetrical loss function is presented in
Figure 4. It can be observed that the function penalizes the model if the predicted voxels are marked as occupied, when, instead, they should be marked as free-space.
3.3. Path Planning Algorithm
For controlling the mobile robot, we used the Zero Moment Point (ZMP) algorithm [
20], which is described briefly in the following:
The Zero Moment Point can be described as the point
$\overrightarrow{z}$ on the ground where the net moment along the vertical axis passing through
$\overrightarrow{z}$ is zero:
where
${\overrightarrow{r}}_{i}$ is the position of the
i-th foot contact point relative to
$\overrightarrow{z}$ and
${\overrightarrow{f}}_{i}$ is the ground reaction force at the
i-th foot contact point
Assuming the center of mass does not move, the ZMP can be approximated as follows:
where
m is the mass of the robot,
g is the gravitational acceleration, and
$\overrightarrow{p}$ is the position of the robot’s center of mass.
The desired target ZMP
${\overrightarrow{z}}_{d}$ is calculated based on the desired motion (e.g., walking forward):
where
f is a function that maps the desired motion to the target ZMP. The robot motion is then adjusted to bring the actual ZMP close to the target ZMP. The desired ZMP
${\overrightarrow{z}}_{d}$ was chosen based on the output of the Dynamic Window Approach algorithm, by selecting the destination point from the optimal generated trajectory. Our DWA implementation is based on a simplified non-holonomic robot model commonly encoutered in wheeled locomotion:
where
x and
y represent the robot’s position in the 2D birds-eye view moving space,
$\Theta $ is the heading, and
$\delta $ the desired direction.
DWA first calculates a dynamic window w for the commands of the robot, based on physical constraints regarding velocity and acceleration. Each command ${v}_{t+1},{\delta}_{t+1}$ from the generated dynamic window is used to produce a candidate trajectory. Afterwards, predictions coming from the scene dynamics encoder network are used, together with current sensor observations to reject candidate trajectories that intersect any obstacle (predicted or observed). Each candidate trajectory that was not rejected in the previous step is then analyzed with respect to several performance indicators, and the following penalties are computed (see Algorithm 2):
Goal cost (${\u03f5}_{G}$), which represents the distance between the Cartesian coordinates of the final state on the candidate trajectory and the coordinates of the goal point located on the global reference trajectory.
Orientation cost (${\u03f5}_{\varphi}$), which represents the difference between the orientation angle from the final state of the candidate trajectory and the orientation angle from the reference orientation angle.
Velocity cost (${\u03f5}_{v}$), which represents the difference between the mobile robot’s velocity and the reference velocity.
Smoothness cost (${\u03f5}_{s}$), which represents the difference between the previous command and the current command (if there is a significant difference between successive commands, then the motion of the mobile robot will become course).
Trajectory cost (${\u03f5}_{t}$), which represents the sum of distances between Cartesian coordinates of the candidate trajectory and the reference trajectory.
Algorithm 2 DWA control loop | |
- Require:
${\mathbf{Z}}_{ref}^{<t-\infty ,t+\infty >}$ - Require:
${\widehat{\mathbf{V}}}_{2D}^{<t+1,t+n>}$ - Require:
${V}_{2D}^{<t>}$ - Require:
${s}^{<t>}$ - Require:
$ZMP\leftarrow \left[\right]$ - 1:
$w\leftarrow $ dwa_window() - 2:
$mi{n}_{C}=\infty $ - 3:
for
$v,\delta \in w$
do - 4:
candidate ← dwa_loop($v,\delta $) - 5:
candidate ← reject_collision(candidate, ${s}^{<t>},{\widehat{\mathbf{V}}}_{2D}$) - 6:
if candidate $\ne \varnothing $ then - 7:
${\u03f5}_{G}\leftarrow $ goal_cost($v,\delta ,{\mathbf{Z}}_{ref}$) - 8:
${\u03f5}_{\varphi}\leftarrow $ orientation_cost($v,\delta ,{\mathbf{Z}}_{ref}$) - 9:
${\u03f5}_{v}\leftarrow $ velocity_cost($v,\delta ,{\mathbf{Z}}_{ref}$) - 10:
${\u03f5}_{s}\leftarrow $ smoothness_cost($v,\delta ,{v}^{<t-1>},$ - 11:
${\delta}^{<t-1>}$) - 12:
${\u03f5}_{t}\leftarrow $ trajectory_cost($v,\delta ,{\mathbf{Z}}_{ref}$) - 13:
$C={K}_{G}{\u03f5}_{G}+{K}_{\varphi}{\u03f5}_{\varphi}+{K}_{v}{\u03f5}_{v}+{K}_{s}{\u03f5}_{s}+{K}_{t}{\u03f5}_{t}$ - 14:
if $C<mi{n}_{C}$ then - 15:
$mi{n}_{C}\leftarrow C$ - 16:
$ZMP\leftarrow candidate\left[n\right]$ - 17:
end if - 18:
end if - 19:
end for
| |
The collision rejection method parses all the centroids returned by the scene dynamics encoder module and rejects all candidate trajectories which intersect obstacles, either seen in the current timestamp observations or predicted by the neural network. After evaluating the total cost C for each of the candidate trajectories, the one with the minimum cost is selected and the ZMP gets updated.
4. Experiments
For evaluating our algorithm, we used sequences of data recorded during the navigation process of our mobile robot in the unstructured environment, more specifically, a forest road navigation task. The mobile robot shown in
Figure 1a navigated within the environment, while potential collision situations were generated. In the meantime, depth sensor data were captured and analyzed during navigation. A global reference trajectory
${\mathbf{Z}}_{ref}^{<t-\infty ,t+\infty >}$ was recorded for the mobile robot.
Samples of the data processed by the algorithm can be viewed in
Figure 5. Depth images—
Figure 5b—are used coupled with the camera parameters in order to obtain a 3D representation of the scene—
Figure 5c. Then, 3D points are processed according to
Section 3.2 and the three-dimensional voxel grid is computed—
Figure 5d. This voxel grid is afterwards projected in a top-down view—
Figure 5e. RGB and depth images were collected while the legged robot navigated on forest roads and interacted with humans, who acted as obstacles for our algorithm. In
Figure 5a, a multiple-obstacle scenario is presented; the two persons are approaching the legged robot while it records RGB and depth images. The thresholding performed on the reconstructed three-dimensional points can be observed in
Figure 5c. Ground points, as well as points that are too high or too far away from the robot, are eliminated from the resulting point cloud before the 3D voxel grid is computed.
The experiments were structured as follows:
Training and testing data were acquired from the navigation scenario.
The scene dynamics encoder network was trained using the preprocessed training data.
The path planning algorithm was evaluated on the preprocessed test data.
We defined a safe distance $s=0.5$ m for our mobile robot. Each violation of this distance, which meant that the mobile robot approached any obstacle within a radius less than s, automatically triggered a collision event. Collision events were counted, producing the first quality measure: ${n}_{col}$.
The second evaluation metric was the cross-track error, defined as the difference between the reference trajectory
${Z}_{ref}$ and the generated local trajectory:
where
$f\left(x\right)$ is the polynomial approximation of the generated local trajectory evaluated in
x and
${y}_{i}$ is the measured point
y coordinate.
The third evaluation metric was the orientation error:
where
${\varphi}^{<i>}$ is the robot’s orientation at time step
i and
${\varphi}_{d}^{<i>}$ is the desired orientation from the global reference trajectory.
In order to reproduce potentially hazardous situations for the mobile robot, we experimented with people interacting with it during navigation. Interactions consisted of people approaching the robot from different angles and with varying velocity, in order to simulate dynamic obstacles.
We simulated four scenarios: front collision, side left, side right, and multiple obstacles. For the first scenario, a single person approached the mobile robot head on, while for the side collision left and side collision right scenarios, the human would approach the robot from the left and from the right sides, respectively. For the multiple obstacles setup, we considered two persons for our experiments.
We compared our approach to the classic Dynamic Window Approach path planner, configured with the same dynamic window generation parameters, as described in the
Section 3.
Additionally, we strictly compared our scene dynamics encoder network with a state-of-the-art object detection algorithm, YoloV7 trained on the MS COCO dataset, as in the original implementation from [
21], in terms of accuracy, false positive rate—
$F{P}_{R}$ and time to collision—
$ttc$ on our test data. The MS COCO dataset contains annotations for object detection (bounding boxes), semantic and instance segmentation, etc. [
22] The time to collision is the estimated time elapsed between the first available detection of an obstacle until the respective obstacle crosses the safe distance threshold.
The results of the comparison between path planning algorithms are summarized in
Table 1, while the results of the comparison between neural networks are presented in
Table 2.
It can be observed that our proposal surpasses the baseline in all four test scenarios, both in the first experiment, as a path planning component, and in the second one, as a means of detecting dynamic obstacles present in the scene.
The number of collisions ${n}_{c}$ was significantly lower in all four testing scenarios in our approach than in the case of classical DWA. This suggests that including predicted observations in the trajectory generation step improves the obstacle avoidance capability of the system. The cross-track error ${\overline{e}}_{ct}$ and orientation error ${\overline{e}}_{\varphi}$, respectively, were predominantly lower using our proposed algorithm than in the case of classical DWA. This is caused by the experimental tuning of the parameters included in our version of DWA (trajectory cost ${\u03f5}_{t}$ and orientation cost ${\u03f5}_{\varphi}$).
The time to collision ($ttc$), false positive rate ($F{P}_{R}$), and accuracy (A) values were better in our approach than the YoloV7 variant because, in our proposal, the algorithm had more data to decide upon, due to the inclusion of predictions for future voxel grids. Additionally, using depth information instead of image data benefits the algorithm due to the fact that there is no need to perform error-prone object recognition.
Ablation Study
For the purpose of assessing the performance of the components from our algorithm, we performed three ablation studies.
In the first one, we compared three loss functions for our neural network: Mean Squared Error, Mean Absolute Error, and Asymmetrical Loss, respectively. It can be observed that the Asymmetrical Loss function converges the fastest out of the three functions, while the loss value is the lowest. We compared the three functions over 500 epochs for data in the validation set. The results can be viewed in
Figure 6.
The second study concerns the ablation of the prediction horizon size versus the time to collision estimate (
ttc) and false positive rate
$F{P}_{R}$. It is obvious that, by increasing the prediction horizon size, the
$ttc$ values improve slightly; however, this also produces a higher
$F{P}_{R}$. Our experiments suggest that the increase of the prediction horizon size generates more noise in the predicted observations, which implicitly causes the algorithm to signal more collisions. The results of the prediction horizon size ablation are shown in
Figure 7.
For the third ablation study, we compared the performance of the algorithm for different values of the history size parameter. For the comparison, we used the following values (in seconds):
$[1,3,5,7,10]$. It can be observed that increasing the history size to a value larger than 5 s does not bring significant improvements to our algorithm. There is a slight improvement in the test result; however, the disadvantages outweigh the benefits. The results of the history size ablation are presented in
Figure 8.
5. Conclusions
In this paper, we have introduced a vision dynamics learning approach for mobile robot navigation in unstructured environments. Our model is based on a recurrent neural network that processes historical data coming from a depth sensor and produces predictions for future time steps over a fixed prediction horizon. We also described the process of converting the depth image into a three-dimensional point cloud, and then voxelizing the point cloud to a bidimensional grid of voxels that can serve as input for our deep learning model.
We have evaluated our model on real-world data, which was gathered during the mobile robot’s navigation on a forest road. The ability to directly encode scene dynamics using a deep learning model has significant implications for robotics, as it can enable a more accurate understanding of the surroundings and more precise interactions between the robot and its environment, thus contributing to better decision making in tasks such as autonomous navigation.
Future work can explore using more complex neural network architectures, as well as integrating additional sensor inputs to the deep learning model, such as images or LiDAR data, in order to further improve the accuracy and robustness of the model.