Next Article in Journal
CSP2Turtle: Verified Turtle Robot Plans
Previous Article in Journal
UAV Power Line Tracking Control Based on a Type-2 Fuzzy-PID Approach
 
 
Article
Peer-Review Record

An Incremental Inverse Reinforcement Learning Approach for Motion Planning with Separated Path and Velocity Preferences

by Armin Avaei 1,†, Linda van der Spaa 1,2,*,†, Luka Peternel 1 and Jens Kober 1
Reviewer 1:
Reviewer 2:
Reviewer 3:
Submission received: 7 March 2023 / Revised: 1 April 2023 / Accepted: 14 April 2023 / Published: 20 April 2023
(This article belongs to the Topic Intelligent Systems and Robotics)

Round 1

Reviewer 1 Report

This paper proposes an inverse reinforcement learning (IRL) approach for motion planning. The novelty is that path and velocity preferences are separated: the method learns and optimizes for the path and velocity preferences separately. The technique is reasonable and it seems natural to me that it works. However, I have concerns about its significance and how beneficial it is in practice. I elaborate more in my comments below.

Major Comments:

1) Separating path and velocity preferences should be useful in terms of data-efficiency and computational efficiency, because it may be easier to learn a pair of d-dimensional linear functions than to learn one 2d-dimensional linear function. Similarly, it might be more efficient to optimize these two function separately (and sequentially). However, it seems to me that this approach makes it impossible to capture/learn preferences that require interplay between positions and velocities, e.g. position-dependent velocity preferences. The paper should discuss whether this is true. If yes, this should be mentioned as a limitation.

2) According to Equation (4), the updates due to human feedback are incorporated over point estimates, similar to gradient descent algorithms. However, the initial learning from demonstrations are done in a Bayesian way (Equation 3). I wonder why the corrections are not used in a Bayesian way, as well. The paper shows Bajcsy et al. [3] as a reference for this approach, but more recent techniques allow incorporating corrections in a Bayesian way, see Jeon et al.'s "Reward-rational (implicit) choice: A unifying formalism for reward learning".

3) Relatedly, I disagree with the paper in Section 5 that Bajcsy et al.'s work [3] is incomparable. The paper claims this is the case because that work does not consider the same/similar set of features. However, the proposed method in that work is feature-agnostic. It is really a learning algorithm. Modifying the set of features it uses would not make it a substantially different method. Similarly, many works in this field propose methods that are feature agnostic, e.g., [13] and [17] the paper cites.

4) In fact, it is surprising to me that the authors chose to pose this paper in a way that the features are fixed and claim these features are good enough for most manipulation tasks. Again, I disagree with this claim. Most manipulation tasks require features that are object-specific. The current set of features might be good for reaching tasks, but fine manipulation requires more features. I recommend changing the presentation of the paper in a way that conveys the novelty of this work is the separation of path and velocity preferences and the features used are application-dependent.

5) The results of the experiment in Section 4.1 are difficult to make comments on, because there is no baseline. We do not really know how users would rate trajectories generated using other methods.

6) For the experiment in Section 4.2, a list of bullet points is given for the path preferences. They simply tell whether the corresponding weights are positive or negative. But how about the tradeoffs between them? Are they coming from the user? This should be clarified.

Minor Comments:

7) Line 174 uses the term "coactive learning" without explaining what it is. Is it simply learning from physical corrections? An explanation should be added.

8) Missing citation in Line 253.

9) Margin error in Line 277.

10) What are the numbers on Figure 5? Are they the number of corrections, demonstrated trajectories, etc.? This should be explained in the caption.

11) A relevant work that the authors may find interesting is Katz et al. "Preference-based learning of reward function features" where the goal is to learn a linear reward function from preferences. However, a neural network is used to learn additional nonlinear functions. This is relevant to the discussion the authors have in Section 6.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

1- Please ask a native to rewrite your paper.

2- The paper requires more references especially when it comes to the introduction section.

3- The introduction requires more extended discussion and you reached your work sooner than the reader expected.

4- I don't think that demo is a good choice for the text in Figure 2.

5- You can use a better colour bar for Fig. 9.

6- Please check the paper template to see if it is possible to use 1a and 1b for equation numbers.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The scientific paper is excellent.

I don't have any comments!

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

I thank the authors for their detailed response. I am glad to see my comments have been addressed in the updated manuscript.

 

Back to TopTop