Next Article in Journal
Exploiting Security Issues in Human Activity Recognition Systems (HARSs)
Next Article in Special Issue
NARX Technique to Predict Torque in Internal Combustion Engines
Previous Article in Journal
Fermatean Fuzzy-Based Personalized Prioritization of Barriers to IoT Adoption within the Clean Energy Context
Previous Article in Special Issue
Lightweight Implicit Blur Kernel Estimation Network for Blind Image Super-Resolution
 
 
Article
Peer-Review Record

METRIC—Multi-Eye to Robot Indoor Calibration Dataset

Information 2023, 14(6), 314; https://doi.org/10.3390/info14060314
by Davide Allegro *, Matteo Terreran and Stefano Ghidoni
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Reviewer 4:
Information 2023, 14(6), 314; https://doi.org/10.3390/info14060314
Submission received: 14 April 2023 / Revised: 18 May 2023 / Accepted: 24 May 2023 / Published: 29 May 2023
(This article belongs to the Special Issue Computer Vision, Pattern Recognition and Machine Learning in Italy)

Round 1

Reviewer 1 Report

The paper presents a dataset for robot calibration.

The dataset is made of real data and synthetic data.

The paper is a  description of a particular setup in the context of development of a robot.

The paper has the following drawbacks:

- it is very specific to a use case

- there is no scientific methodology to evaluate the results

 * There are no theoretical limits and best expected precision

* The comparison with other dataset is incomplete : how to compare different data ?

- The synthetic data are not realistic : it can be seen that lighting, shadows, non uniformity of surfaces, textures, etc ... are missing. There are no proof that this missing details are altering or not the results

But, more embarrassing, there is no evidence of the takeaway message : what can I learn from this paper if I'm doing my own robot calibration, for my setup ? What methodology should I apply ?

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper "METRIC - Multi-Eye To Robot Indoor Calibration Dataset" addresses a scenario that is central to many industrial and research processes, when a robotic manipulator is operating in a workcell and is monitored by a network of statically mounted optical sensors that surround the cell. In order to assess the pose of the manipulator and control the geometry of its movement, the sensor network must be calibrated. In particular, it is necessary to determine the relative sensor positions and orientations to each other, as well as the position of the robot with respect to the network. The present manuscript describes a sample calibration dataset that can be used to evaluate various calibration tools and workflows. The dataset includes images obtained by different types of sensors in a controlled lab environment, and synthetic images, augmented by the ground truth data.

The text is well-written and has a straightforward structure, and the figures and tables are legible and have sufficiently detailed captions.

There are only a few datasets such as those presented in this paper in the public domain. Therefore, the current work will be definitely valuable to the community. Nevertheless, we believe that the text requires multiple modifications in order to be suitable for publication at the Information journal.

In particular, the manuscript does not specify any data formats, file types and sizes, metadata types, scene characteristics, etc. that will eventually be provided in the dataset. (As of writing, the text contains no actual references to the uploaded data.)

Another glaring problem is the lack of information about the specific data acquisition conditions and processing toolchains applied to each type of data. What kind of illumination was present in the lab during the data collection? What were the parameters of optics and light sensors? Was any automatic adjustment or internal data processing involved? Which optical properties and effects have been accounted for in the simulation toolchain?

In their discussion of data quality with respect to several existing calibration methods, the authors only very superficially discuss the common data processing steps and the quality of intermediate outcomes. In particular, the intrinsic calibration of all the sensors is supposed to have been provided by a separate process, but no details of that procedure nor any quality characteristics of intrinsics are reported.

Other major processing steps, such as corner detection or AprilTag identification, may also appear and perform quite differently for the LIDAR, ToF, and stereoscopic data. It would be very instructive for the prospective users to see the actual workflows and parameter sets. Given such secrecy, it is very difficult to interpret Tabs. 4-19 and derive any conclusions, as the main constituents of the error budget remain unknown.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Dear authors,

in your paper you describe a dataset for testing camera2camera and camera2robot extrinsic calibration.

Generally speaking in my opinion such a dataset for comparing different algorithms has the following drawback: It does only say something about how the algorithms perform on your specific choice of the calibration setup (which calibration object(s) to use, where to put it/them). For different camera setups (no of cameras, distance of cameras, overlap between camera images, opening angles ...) and different calibration setups the result are likely to be different so that only a very limited general conclusion can be drawn out of testing on your dataset when I like to have an answer to the question: Which algorithm performs best on my dataset ?

Beside that argument a have to following major concern which must be changed in my opinion: With your calibration setup it is impossible to estimate a suitable calibration, especially for the camera2camera calibration.  I would regard a "suitable" camera2camera calibration a calibration which has a rotation error in the order of the angle for a pixel, so that we can reliable estimation a 3D position out of >=2 corresponding image points (or in other words: the epipolar line roughly touches the correct pixel). E.g. for the used D455 this would be in vertical direction 65°/800 = 0.08°, which is some orders of magnitude away from your comparison against GT you provide.

The reason for this bad result is your calibration setup: The calibration object is much much much too small in the image. You always argue that if the calibration object is further away, then the pose becomes more inaccurate because the corner detection accuracy decreases. But the main reason for the more inaccurate pose is a geometric one: Even if the accuracy of your image coordinates would stay the same, the pose accuracy would decrease significantly because you will have a glancing intersection of the rays at the projection center: The angles between the observation rays are simply so small that no reliable intersection point can be found. (This is similar to the accuracy of the depth in a stereo pair; the standard deviation of the depth goes down with the square of the distance because of too small intersection angle).

So with your small area of the calibration object in the image you always get a glancing intersection. So the dataset only shows what will happen when a (sorry for my strong word) improper calibration setup is used.

Thus I'd like to see (e.g.additional) a setup where e.g. the
checkerboard is so big, that it fills a reasonable (e.g. min 25%) part of the images. (If this is not possible actually more than one calibration object, e.g. coded target should be used).


=============================================================

One small minor comment for the whole paper:

hand-eye calibration: Differentiate more clearer between
- camera on manipulator
- camera fixed relative to the robot, thus watching the manipulator.

Thus which reference does what; please make clear what you are aiming for.


=============================================================
Text specific minor comments:

p.2 l70-73: Note that this can be done differently:
- measure some points in the room/on the robot with known coordinates in the
  robot's base system
- using e.g. circles or spheres as targets ==> no corner detection needed
So be less strict in your formulation.

chapter Related works: All procedure I am aware of do some minimizing of the reprojection error, regardless if they explicitly mention this or not.
Maybe they differ in the used cost or weighting (e.g. L1 instead
of L2 minimization) but it's always reprojection error. E.g. the sentence
in p.5 183-186 sound like that in the PnP problem no reprojection error
is minimized. That's not really true, because here rays are instead
of image coordinates. To minimize a "distance" between rays has the
same effect as minimizing "distances" between the image coordinates.

Figure 1: It seems that the poses of the calibration pattern does not
form a circle, thus camera 2 and 3 see less of the pattern. This
seems no to be optimal to get a result with homogeneous accuracy.
Why is this the case ?

p.9 l 316: I would argue the other way around: A sensor with a
larger FoV allows for more measurements with larger angles between
the rays, thus also for better accuracy (if a suitable sized calibration
object is used) for the extrinsic parameters. Think of a setup where you have put coded targets all around the room. Then a camera with a larger FoV would see more of them with larger angles between the rays, thus better accuracy. And the overlapping area of camera is bigger. The main thing is: You have to USE the larger FoV, thus adapt the calibration object(s). That must be made clear to the reader to avoid the incorrect impression that you can reach (in general) better accuracy which smaller FoV.

p.9 l. 327: Again the same thing: The pure distance to the
Apriltag says nothing about "high accuracy": If the apriltag
has not a suitable size (a large part of the image is occupied
by the tag), the accuracy is bad. In this case either use more
than one tag if known relative pose or use a large tag. As
there is no information about the tag size in the image, it
is impossible to judge if the relative pose camera to apriltag
is accurate enough to play the role of a ground truth.

p9, formula (2): What happens if by change two transformations rotate
by the same angle, but on a different axis ? Then your metric would be
zero, but we have still different rotations. (Alternative: the (mean of)
3 rotation angles around X,Y,Z axis of R_1^t * R_2)

p10, formulas (3),(4):
A extrinsic calibration of a camera network of N cameras delivers
N-1 relative poses, e.g. the pose of the second to fourth camera
in the camera coordinate system of the first camera. So I do not
understand at all why you are
- counting every relative pose twice (e.g. T_12 and T_21 are in the sum)
- using an overdetermined setting (given T_12, T_13, T14,  e.g. T_23
can be computed by T_13* inv(T_12))
These sums over N*(N-1) only makes sense if you do pairwise alignment,
thus some stereo relative pose algorithm (and that twice per pair), which was
never mentioned in the text. Usually the results come from a common (bundle)
adjustment where there are (N-1)*6 unknowns for N poses.
If some algorithm delivers overdetemined result (so more than N-1 poses), I would
do a posegraph optimization to get a unique result (which should have better accuracy).
For me the sums should be over N-1 elements.

formula (4) and formula (6):
For the deviations, why do you go away from the standard setting
and do NOT report standard deviations aka using the L2 norm
sqrt( 1/n-1 * sum( (e-mue)^2) ?? The L1 norm used here usually only makes
sense if there are outliers and you want to have a small influence of outliers.
So for me, L1 norm make no sense at this point, at least it
must be discussed why you go away from the standard setting used
in the whole statistics area (using variance/standard deviations)

- Table 4, Figure 4: Please mention the dataset (the synthetic one) to avoid
confusion

- p. 11, l369: lens distortion does not cause blurred image collection,
instead lens distortion (if not properly corrected by the intrinsic calibration)
can be reason of a deformation in the 3D reconstruction and trajectory/pose
estimation.

- Figure 6: I do not see any light reflection which could be a problem
  for the detection of the checkerboard. Please better explain what you mean/
  show it with an arrow/use a zoomed in view

- p11, l400: "disparity" is the difference in x-coordinates corresponding
  points of a stereo pair. Do you mean just "difference" ?

- Figure 7: the names of the approaches are not consistent with Table 7




Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

The manuscript presents a METRIC dataset, which is important in applications such as object detection and tracking with possible occlusions. The manuscipt looks good but some comments are list as below:

1) Per line 115-118:

The authors have mentioned that the real dataset aims to evaluate the robustness of the cameras in real-world images. The reviewer wants to see some experimental results of outlined benckmarks that only focus on studying the robustness of the cameras in such dataset. Did the benchmark fail in certain scenarios? A failure analysis is expected for such discussion. 

2) Please be careful about the indent when starting new a paragraph, see line 59, 74, 86, 119. Only a part of such typos have been given, please check the manuscript throughout. 

 

Language is generally fine, but minor editing is required. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Dear authors,

thanks for your detailed answers to my comments and the changes you made
in your paper. I think the paper got improved.

Still I'd like to draw you attention back to an argument in my major
concern which I didn't find in your revised text: I wrote:

"Even if the accuracy of your image coordinates would stay the same,
the pose accuracy would decrease significantly because you will
have a glancing intersection of the rays at the projection center".

I think this is a very important statement. To convince you, I wrote
a small program where you can see the effect of the glancing intersection
in a plot (distance camera to control points vs error). You'll find
it at the end of this text. Again: Even if you assume CONSTANT
accuracy of image coordinates, the accuracy of the resulting
pose (especially the Z==distance to control points) drops drastically,
more than linear ! Thus the estimation process becomes more difficult
BECAUSE OF GLANCING INTERSECTION, independent how good you can measure
your points in the image.

So please mention this in your paper, because this an a very important
message for the reader.

Another thing which I would really appreciate if it appears in the final
version: You mention several times that for a large work cell it is
difficult to get a good calibration. The reader may get the impression
that it is not possible in general to get a good calibration in this setup.
That's not really true: If you would place control points (e.g. april tags) in
the room so that the appear in all the corners of the images,
measure them e.g. manually so that you have them in a common coordinate system,
then you will get a good calibration. I know that this is not fancy,
but this is a way to get a good calibration in this kind of setups.
If you would mention this possibility as an alternative for people
(like me) which like to have better accuracy than the
values you reported, this would be very nice.

--------------------------------------------------------------------
Short minor comment to Figure 6:

Thanks for the zoom; now I see that the actual problem is
not image blur (in the usual sense that the image is not sharp), but
over-exposure: Here the classical problem appear that the neighbours
of over-exposed pixel "fill" their neighbour pixels. This is a problem
of automatic exposure setting which tries to optimize for the whole
image (and not for the calibration pattern). If you like, you can
mention over-exposure as an additional challenge.


Comment to size of the april tag:
Please take into consideration to make the april tag larger:
If I'm right it's currently smaller than the checkerboard. For an
april tag you only get 4 corners, for the checkerborad you get 6x4=
24 corners. I would assume that the point accuracy in the image
from the apriltag corners differ not to much from a (good) detector
for the checkerboard in an unproblematic view. Thus the camera pose
accuracy using the smaller apriltag may be even worse than for the checkerboard in a good view .... that's not good for a "Ground truth".


 


--------------------------------------------------------------------
Short thank you:
Regarding your response 4:
Thanks for pointing out the terms “robot-world hand-eye calibration” versus
“hand-eye calibration”; I was not aware of the difference.


-------------------------------------------------------------------
Here is my small demo program (I hope not destroyed by the text formatter):

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import cv2
import numpy as np
import matplotlib.pyplot as plt
# Define Groundtruth:

# The 3D coordinates: a grid in XY-plane
X = np.zeros((5*5,3))
for x in range(5):
  for y in range(5):
    X[x*5+y,0] = -0.10 + 0.05*x
    X[x*5+y,1] = -0.10 + 0.05*y
    
# Camera intrinsics
K = np.asarray([[300, 0, 200],
              [0, 300, 320],
              [0, 0, 1]], dtype = "double" )

 
collected_errors=[]
distances =  np.arange(-0.1,-20,-1)
for distance in distances:
  trial_errors=[]
  for trial in range(1000):
    # Ground-Truth camera extrinsics
    X_center=np.array([[0.2,0.3,distance]]).T     
    R = np.eye(3)
    P = K @ R @ np.concatenate((np.eye(3), -X_center), axis=1)
    
    # Ground-Truth image observations
    x = P @ np.concatenate((X.T, np.ones((1,5*5))), axis=0)
    for i in range(x.shape[1]):
      x[0,i] /=x[2,i]
      x[1,i] /=x[2,i]          
      
    x = x[0:2,:].T
                               
    # CONSTANT Noise on image obs, INDEPENDENT OF DISTANCE
    x = x + 0.5 * np.random.randn(x.shape[0], x.shape[1])
    
    # opencv needs r,c I guess ???
    #h = np.copy(x[:,0])
    #x[:,0] = np.copy(x[:,1])
    #x[:,1] = h
    
    dist_coeffs = np.zeros((4,1)) # Assuming no lens distortion
    (success, rotation_vector, translation_vector) = cv2.solvePnP(X, x, K, dist_coeffs, flags=cv2.SOLVEPNP_ITERATIVE)
     
    print("Rotation Vector:\n {0}".format(rotation_vector))
    print("Translation Vector:\n {0}".format(translation_vector))
    
    X_center_computed = - translation_vector
          
    trial_errors.append(np.abs(X_center_computed[2] - distance ))    
    
  collected_errors.append(np.mean(trial_errors))
 
 
# Error versus distance
plt.figure(1)
plt.clf();
plt.plot(distances, collected_errors, 'r-')
plt.pause(0.02)
 

Back to TopTop