Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Robust Image Matching Based on Image Feature and Depth Information Fusion

Machines 2022, 10(6), 456; https://doi.org/10.3390/machines10060456

by Zhiqiang Yan

, Hongyuan Wang^*, Qianhao Ning and Yinxi Lu

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Pengpeng Hu

Machines 2022, 10(6), 456; https://doi.org/10.3390/machines10060456

Submission received: 5 May 2022 / Revised: 27 May 2022 / Accepted: 6 June 2022 / Published: 8 June 2022

(This article belongs to the Topic Intelligent Systems and Robotics)

Round 1

Reviewer 1 Report

Dear authors,

I would like to thank you for your efforts in producing this interesting paper.

The paper addresses a relevant topic of image matching based on image features and depth information fusion. The proposed method is tested on several datasets. The result from the experiments clearly shows the ability and usability of the proposed way for RGBD image matching. I found the paper generally well written, and I would suggest just a few minor improvements as listed below.

The introduction part summarizes the methods for image and point cloud feature extraction and its fusion. For this part, I have only one comment:

Although most of the methods listed here are well known, please add the definition for the abbreviations, like SIFT, FAST, SURF, BRISK, ORB, etc.

Figure 10. – I have only a small suggestion for this figure – the circles illustrating the points should be above the other drawing (lines and other circles), as in the case of pk12.

line 220 – the definition for the YCB is missing.

line 243 – please add the definition for the RANSAC abbreviation.

Author Response

Dear Reviewers:

Thank you for your letter and comments concerning our manuscript entitled " Robust image matching based on image feature and depth information fusion " (ID: machines-1735972). Those comments are all valuable and very helpful for revising and improving our paper, as well as the essential guiding significance to our research. We have studied the comments carefully and have made a correction which we hope meets with approval. Revised portions are marked up using the “Track Changes” function. The paper's primary corrections and response to the reviewer's comments are as follows.

Comment 1. The introduction part summarizes the methods for image and point cloud feature extraction and its fusion. For this part, I have only one comment: Although most of the methods listed here are well known, please add the definition for the abbreviations, like SIFT, FAST, SURF, BRISK, ORB, etc.

Response: Thank you for your reminder. We have added definitions before these abbreviations in the introduction part, such as SIFT, FAST, SURF, BRISK, ORB, FREAK, etc.

Comment 2. Figure 10. – I have only a small suggestion for this figure – the circles illustrating the points should be above the other drawing (lines and other circles), as in the case of pk12.

Response: Thank you for your correction. We have placed the circles illustrating the points on the other drawing (lines and other circles) in Figure 10 (now is Figure 6), and Figure 8 (now is Figure 5) has been similarly modified.

Comment 3. Line 220 – the definition for the YCB is missing.

Response: Thank you for your reminder. YCB and KITTI are the names of RGBD datasets. We have added their definitions in lines 226 and 259, respectively.

Comment 4. line 243 – please add the definition for the RANSAC abbreviation.

Response: Thank you for your reminder. We have added the definition of RANSAC in lines 255, respectively.

Author Response File: Author Response.pdf

Reviewer 2 Report

The idea of fusing image features with depth is definitely a good one. There is at least one critical flaw in the method that should be fixed and the experiments re-run before publication. I would expect this to improve the results. It makes sense to use the Hamming distance for ORB features, since each bit is a boolean comparison between pairs of points. The order of points within the feature vector matters for matching, but the first bit is no more important than the last. Floating point PFH or FPFH data is converted to binary then appended to the ORB descriptor. They then still use the Hamming distance on this combined descriptor, except the bit position in the PFH or FPFH portion is significant. Using this method, points with PFH differences only in their most significant bit would be considered very similar, despite vastly different PFH fingerprint values. I think the best solution would be to treat them as a tuple of PFH fingerprint to be compared with floating point distance and ORB fingerprint to be compared using the Hamming distance.

Some details in the exposition also need to be clarified.

How, exactly, are the PFH and FPFH converted to binary? It mentions that they are normalized, so they could be converted to fixed-point integers. Alternately, they could take the floating point representation as a binary value. Of course, if you didn't convert to binary at all, this point would not matter.
There is an error in the Euclidean distance formula in equation 3. The per-component difference should be squared. I'm assuming this is just an error in the equation in the paper, and not actually a mistake in the original work.

Larger text edits:

My preference is for the introduction to explain why the problem is important and get to the point of "the main contributions of this paper are..." sooner. With the detailed coverage of related work after that. That's not a necessary change, but it'd sell the paper better if the reader is given more of a glimpse into what it offers sooner.
The coverage of the feature extraction methods in section 2 is way too detailed and verbose for a paper that is not a survey or thesis. We can look up the details of each method if we don't already know them from the original reference. Tell us the most important points (scale invariant?, rotation invariant? Size of feature vector and whether it is binary or floating point.

Smaller text issues:

Citations in the text should be last name only (Lowe et al., not David C. Lowe et al.). Change throughout
Line 50: 512-bit not 512Bit
Line 54: need a space between BRIEF and citation.
Line 57: FAST should be capitalized
Line 123: The words "point cloud" are used repetitively twice in the same sentence.
Line 162: "according to a particular method" — what method?
Line 162: "N is generally", not "N generally takes"
Line 220 among other places through the text: "floating point" not "floating". You need the "point". "float" is probably OK, since it is the common name for this type in all C-like languages.
Line 238: The ORB descriptors are collections of Boolean values. The ORBPFH and ORBFPFH can at best be described as binary values. Unlike ORB, they are no longer an un-ordered collection of truth values (this is also why the Hamming distance is no longer valid)
Line 347 and also line 292 (in the references): "In Proceedings of the Proceedings".

Author Response

Dear Reviewers:

Comment 1. The idea of fusing image features with depth is definitely a good one. There is at least one critical flaw in the method that should be fixed and the experiments re-run before publication. I would expect this to improve the results. It makes sense to use the Hamming distance for ORB features, since each bit is a boolean comparison between pairs of points. The order of points within the feature vector matters for matching, but the first bit is no more important than the last. Floating point PFH or FPFH data is converted to binary then appended to the ORB descriptor. They then still use the Hamming distance on this combined descriptor, except the bit position in the PFH or FPFH portion is significant. Using this method, points with PFH differences only in their most significant bit would be considered very similar, despite vastly different PFH fingerprint values. I think the best solution would be to treat them as a tuple of PFH fingerprint to be compared with floating point distance and ORB fingerprint to be compared using the Hamming distance.

Response: Thank you sincerely for your suggestion, which is of great help in improving our paper. According to your suggestion, we have carried out the experiment again. The updated experimental results are shown in Table 3 and Table 4 (During the experiment, we found that the camera's internal parameters were set incorrectly before, and we modified them together this time. Therefore, we can see that there are some changes in the results of SIFTPFH, SIFTFPFH, SURFPFH, and SURFFPFH in Table 3 and Table 4). It can be seen from Table 3 and Table 4 that the experimental results have been greatly improved. Although the matching strategy of ORBFPFH and ORBPFH has been changed, the feature matching time is basically the same, so there is no change in Table 1 and Table 2.

Finally, other relevant contents are modified as follows.

In the section of "Feature Fusion", the description of "(2) The image feature descriptor of ORB is a binary string. In order to maintain the advantages of small memory and fast speed of the binary descriptor when fused with the point cloud feature descriptors of PFH and FPFH, this paper directly converts the normalized point cloud feature descriptors into binary descriptors and then splices them with ORB to form the fusion feature descriptors of ORBPFH and ORBFPFH." is changed to the description of "(2) The image feature descriptor of ORB is a binary string, and the point cloud feature descriptors of PFH and FPFH are floating-point. In order to maintain the respective feature description ability of binary descriptor and floating-point descriptor, the data types of the two descriptors are kept unchanged and combined into a tuple, thereby obtaining fusion feature descriptors of ORBPFH and ORBFPFH."

In the section of "Feature Matching", the description of "The data types of the ORBPFH and ORBFPFH feature descriptors are Boolean values, so the Hamming distance is used as the similarity evaluation index of the feature point. The calculation process of the Hamming distance is: to compare whether each bit of the binary feature descriptor is the same. If not, add 1 to the Hamming distance." is changed to the description of " As mentioned earlier, the ORBPFH or ORBFPFH feature descriptor is a tuple in which the Hamming distance of the ORB descriptor is calculated, the Euclidean distance of the PFH or FPFH descriptor is calculated, and the two distances are added to obtain the final feature distance. The calculation process of the Hamming distance is: to compare whether each bit of the binary feature descriptor is the same. If not, add 1 to the Hamming distance." In addition, we added the definition of feature distance for ORBPFH or ORBFPFH descriptor matching, as shown in equations 4 and 5.

The updated experimental results are shown in the section of "Experiment and results", "Conclusions" and "Abstract".

The new experimental results are updated in the web page displayed in Supplementary Materials.

Comment 2. How, exactly, are the PFH and FPFH converted to binary? It mentions that they are normalized, so they could be converted to fixed-point integers. Alternately, they could take the floating point representation as a binary value. Of course, if you didn't convert to binary at all, this point would not matter.

Response: We originally converted the floating-point numbers in the PFH or FPFH descriptor to the nearest integer and then represented the integer with a binary number to form a binary descriptor. Now, we do not convert to binary at all, according to your suggestion. However, we find that because the norm of PFH or FPFH is small, the point cloud feature has little impact on the matching accuracy. Therefore, to increase the weight of the point cloud feature, we usually multiply it by a coefficient to make the norm of PFH or FPFH close to the length of ORB features. The specific description is shown in the section of "Feature Fusion" or below. Therefore, we updated Figure 10.

"Because the norm of PFH or FPFH is minor, to increase the weight of point cloud features, we usually multiply a coefficient to make the norm of PFH or FPFH after multiplication close to the length of ORB features."

Comment 3. There is an error in the Euclidean distance formula in equation 3. The per-component difference should be squared. I'm assuming this is just an error in the equation in the paper, and not actually a mistake in the original work.

Response: Sorry, we missed the square of formula 3 when writing the paper. Thank you very much for your correction. We have corrected formula 3 by adding the square of per component difference.

Comment 4. My preference is for the introduction to explain why the problem is important and get to the point of "the main contributions of this paper are..." sooner. With the detailed coverage of related work after that. That's not a necessary change, but it'd sell the paper better if the reader is given more of a glimpse into what it offers sooner.

Response: Thank you very much for your suggestion. We have modified the introduction part as you suggested first to explain why the problem is essential, then get the point of "the main contributions of this paper are...", and added the "Related Work" section to coverage of related work detailly. The specific modifications are shown in the "Introduction" and "Related Work" sections.

Comment 5. The coverage of the feature extraction methods in section 2 is way too detailed and verbose for a paper that is not a survey or thesis. We can look up the details of each method if we don't already know them from the original reference. Tell us the most important points (scale invariant? rotation invariant? Size of feature vector and whether it is binary or floating point).

Response: Thank you very much for your suggestion. This suggestion is excellent, and we tried to keep only the most critical points (such as scale-invariant, rotation invariant, feature vector size, and whether it is a binary or floating-point). It seems to make the manuscript's content look inadequate, so we keep the coverage of the feature extraction methods in Section 2. However, we have made some adjustments to the specific content in section 2 to make it more compact and less redundant, and we hope to get your consent. Specific modifications are as follows and shown in the "Feature Extraction and Matching" section.

(1) Figure 2 and Figure 3 are combined into one figure;

(2) Figure 4 and Figure 5 are combined into one figure;

(3) Figure 6 and Figure 7 are combined into one figure.

(4) Figure 8 and Figure 9 are combined into one figure.

Comment 6. Citations in the text should be last name only (Lowe et al., not David C. Lowe et al.). Change throughout.

Response: Thank you for your correction, and we have changed all the citations in the text to last names.

Comment 7. Line 50: 512-bit not 512Bit.

Response: Thank you for your correction, and we have revised 512bit to 512-bit in line 50 (Now is line 69).

Comment 8. Line 54: need a space between BRIEF and citation.

Response: Thank you for your reminder, and we have added a space between BRIEF and citation in line 54 (Now is line 73).

Comment 9. Line 57: FAST should be capitalized.

Response: Thank you for your reminder. We have capitalized the "fast", "brief" and "orb" in line 57 (Now is line 76), and corrected the sentence from "The contribution of ORB is that it adds fast and accurate direction components and efficient calculation for brief features to fast so that that orb can realize real-time calculation" to "The contribution of ORB is that it adds fast and accurate direction components to the FAST and efficient calculation for the BRIEF features so that ORB can realize real-time calculation"

Comment 10. Line 123: The words "point cloud" are used repetitively twice in the same sentence.

Response: Thank you for your correction, and we have deleted the first " point cloud " in line 123 (Now is line 128).

Comment 11. Line 162: "according to a particular method" — what method?

Response: Sorry, we did not make it clear. We have replaced the description "…and then selects N pairs of points in this block according to a particular method…" with "…and then randomly selects N pairs of points in this block…".

In fact, ORB generates a rotated BRIEF descriptor by randomly selecting N pairs of points. As shown in the figure below, five random selection methods are tested in the original ORB paper. The original author concludes that the second random method has better results, so this paper also adopts the second method.

Fig.1. Different approaches tested in the original ORB paper to choosing the test locations

Comment 12. Line 162: "N is generally", not "N generally takes".

Response: Thank you for your correction, and we have corrected "N generally takes" to "N is generally" in line 162 (Now is line 168).

Comment 13. Line 220 among other places through the text: "floating point" not "floating". You need the "point". "float" is probably OK, since it is the common name for this type in all C-like languages.

Response: Thank you for your reminder, and we have replaced "floating" with "floating point" throughout.

Comment 14. Line 238: The ORB descriptors are collections of Boolean values. The ORBPFH and ORBFPFH can at best be described as binary values. Unlike ORB, they are no longer an un-ordered collection of truth values (this is also why the Hamming distance is no longer valid).

Response: Thank you sincerely for your correction and suggestion. According to your comment 1 and this comment, we have also made corresponding corrections here, as shown below.

Comment 15. Line 347 and also line 392 (in the references): "In Proceedings of the Proceedings".

Response: Thank you for your reminder, and we have replaced " In Proceedings of the Proceedings " with " In Proceedings of " in lines 347 (Now is line 363) and 392 (Now is line 408), and checked other references. Since we used the Endnote software, there seems to be no trace of modification.

Author Response File: Author Response.pdf

Reviewer 3 Report

The combination of RGB and depth images have shown advantages in different tasks. One of the challenges is how to fuse the features. In this article, the authors explore fusing traditional features extracted from RGB-D images. The target is interesting, but the paper is not well written. For instance, the abbreviation should be consistent. RGB-D or RGBD? In the introduction, “[…] various new RGBD depth cameras […]” is incorrect. The “D” in “RGB-D” means depth. In the “Feature Extraction and Matching” section, the authors stated “[…] according to the pixel correspondence between RGB and depth images […]”. The sizes of them are usually different, so how to establish the pixel correspondence? The techniques introduced in the article look like the simple integration of existing traditional methods. The literature review is insufficient. It can be noted the references are old, and many popular papers about RGB-D image feature fusion are not mentioned.

Author Response

Dear Reviewers:

Comment 1. The abbreviation should be consistent. RGB-D or RGBD?

Response: Thank you for your reminder, and we have replaced "RGBD" with "RGB-D" throughout the manuscript.

Comment 2. In the introduction “[…] various new RGBD depth cameras […]” is incorrect. The “D” in “RGB-D” means depth.

Response: Thank you for your correction, and we have replaced "RGB-D depth cameras" with "RGB-D cameras" throughout the manuscript.

Comment 3. In the “Feature Extraction and Matching” section, the authors stated “[…] according to the pixel correspondence between RGB and depth images […]”. The sizes of them are usually different, so how to establish the pixel correspondence?

Response: Thank you for your suggestion. In fact, as you are concerned, the depth images are generally obtained by the Lidar or TOF camera, and the RGB images are generally captured by the visible camera. Due to the limitations of hardware technology such as Lidar or TOF camera, the size of the depth image is often smaller than that of an RGB image. However, we can obtain the transformation matrix between the visible camera and the Lidar or TOF camera through camera calibration technology. Then, the depth image is projected into the coordinate system where the RGB image is located, and keeps its size consistent through the sampling technology. Since the existing public RGB-D datasets (such as YCB or KITTI) have done this for us, and the size of the RGB image and depth image in these datasets are the same, we apologize for not stating this in the previous manuscript. We have added some descriptions (from line 133 to line 142) in the "Feature Extraction and Matching" section as follows.

“It is worth mentioning that the depth image is generally obtained by Lidar or TOF camera, and the RGB image is generally taken by the visible light camera. Due to different camera hardware technology differences, the size of RGB image and depth image is often different. However, the transformation matrix between the visible light camera and Lidar or TOF camera can be obtained through camera calibration technology. Then, the depth image is projected into the coordinate system where the RGB image is located and keeps its size consistent through the sampling technology. The existing public RGB-D datasets usually have done this for us, and the size of the RGB image and depth image in these datasets are the same. Therefore, these transformations are not described in detail in this paper.”

Comment 4. The techniques introduced in the article look like the simple integration of existing traditional methods.

Response: Thank you for your comments. In fact, we want to try to propose an RGB-D feature fusion method. This kind of feature fusion idea may be our main innovation. The proposed method may be compatible with almost all feature point-based RGB image feature descriptors and feature point-based point cloud feature descriptors. At the same time, in order to adapt to practical engineering applications, we choose the existing classical and state-of-the-art feature description methods of RGB image and point cloud in our experiments. The experimental results show that the proposed method can indeed improve the registration accuracy of RGB-D images.

Comment 5. The literature review is insufficient. It can be noted the references are old, and many popular papers about RGB-D image feature fusion are not mentioned.

Response: Thank you for your suggestion. We have added some new reference papers about RGB-D image feature fusion in the section of "Related Work", as shown below.

"Especially with the development of artificial intelligence technology, many feature extraction and fusion technologies based on deep learning technology have emerged, such as reference [31-34]. These methods always need a large amount of data to train network models. However, obtaining these extensive training sample data may be difficult under some application conditions. Therefore, this paper mainly discusses the traditional feature extraction and fusion methods."

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The authors have adequately addressed all of my concerns from the previous draft. I would be quite happy to see this published in its current version.

I would suggest trying to avoid page breaks in the middle of tables. None of these tables are long enough to need to span more than one page.

Author Response

Dear Editors and Reviewers:

Responses to Reviewers

To Reviewer 2:

Comment 1. I would suggest trying to avoid page breaks in the middle of tables. None of these tables are long enough to need to span more than one page.

Response: Thank you for your suggestion, all page breaks in the middle of tables are avoided.

Reviewer 3 Report

Thank the authors for revising the manuscript. However, I have to say the authors did not address my concerns well. It is not acceptable to hear “[…] The existing public RGB-D datasets usually have done this for us […]”. The proposed method is expected to process the wild images, then who did this for you? I appreciate the authors have added more figures to “enhance” their algorithm. Unfortunately, these are only additional introductions to the existing methods. The proposed method is a simple integration of several existing methods. I, thus, have to suggest rejecting this submission.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 3 Report

The authors have addressed my main concerns I agree to accept it current version.

Article Menu

Robust Image Matching Based on Image Feature and Depth Information Fusion

Responses to Reviewers

To Reviewer 2:

Further Information

Guidelines

MDPI Initiatives

Follow MDPI