In the first thread, we initially employ Mask R-CNN to detect dynamic objects in each input frame, performing pixel-wise segmentation on dynamic objects with prior information (e.g., humans). However, the semantic segmentation model exposes some issues in dynamic detection: movable dynamic objects such as chairs cannot be semantically segmented due to the lack of prior information in the neural network model, and the legs or hands of humans cannot be segmented due to model accuracy limitations. These unfiltered dynamic feature points may introduce additional noise and instability, affecting the robustness and accuracy of navigation. Therefore, based on the segmentation results from Mask R-CNN, we propose a geometric dynamic feature filtering algorithm to further filter out objects without prior information and human limbs that are not segmented. In the second thread, ORB feature points are extracted from the current image frame, and feature point pair matching is accelerated by using the Bag-of-Words model [
16] based on the BRIEF descriptor between the matching point pairs. Initially, we establish a bidirectional scoring strategy for filtering out highly dynamic feature points. This strategy utilizes the geometric discrepancy of corresponding edges between adjacent frames, assigning abnormal scores to the two feature points on these edges. Subsequently, we utilize a transformation matrix estimation based on the filtered feature points to remove additional outliers. During this process, a function for evaluating the similarity between key points in two images is optimized. This module effectively segments both dynamic objects with no prior information and potential dynamic objects identified by the CNN. Ultimately, the filtered static feature points are used for pose estimation.
2.1. Bidirectional Scoring Strategy for the Filtering of Dynamic Feature Points
Figure 2 shows a conceptual diagram of bidirectional scoring strategy for the filtering of dynamic feature points. The part inside the blue rectangular box is the accuracy constraint for matching points. In the stage of image feature point matching, we identified matching point pairs when the maximum Hamming distance was significantly larger than the second-closest Hamming distance; then, they were classified as reliable matches. The portion inside the green rectangular box represents our bidirectional scoring strategy. After passing through the accuracy constraint for image feature points, the feature point pairs had a relatively accurate matching relationship, but dynamic feature points were also retained. To further filter out the dynamic feature points in a scene, this study introduced a geometric constraint model of the image feature points, as shown in
Figure 3. The query image
and the target image
are two adjacent frames, and the sampling time interval between them is very short; therefore, the camera projection distortion caused by the camera’s pose transformation is very small. The triangle vertices in the figure represent the extracted ORB feature points, with three pairs forming two triangles
and
on
and
; the vertices
,
, and
are the three feature points on
, and they match the three feature points
,
, and
on
, respectively. Furthermore, each side,
d, of the triangle is the Euclidean distance [
17] between the feature points.
In the context of dynamic SLAM, the decision to employ an exponential function for characterizing geometric constraint scoring functions stems from the distinctive properties inherent in the exponential function. The exponential function serves the crucial purpose of mapping the disparities in geometric constraint scores onto a non-negative range. Through exponentiation, this design ensures that scores corresponding to minor differences in geometric constraints converge towards 1, while scores associated with more significant differences experience rapid growth. This mapping relationship proves invaluable for accentuating subtle differences in geometric constraint scores, thus enhancing the sensitivity and facilitating the differentiation between static and dynamic feature points.
Apart from the exponential function, other functions can also be used to characterize geometric constraint scoring functions, but they all have certain limitations. For instance, linear functions may lack sensitivity in representing minor geometric constraint differences due to their linear variations. Logarithmic functions may prove to be overly responsive, particularly in the presence of minor geometric differences. Sigmoid functions may exhibit saturation under certain circumstances, resulting in inadequate sensitivity in extreme cases.
The selection of the exponential function is guided by its unique ability to distinctly express score differences, providing a nuanced representation of changes in geometric relationships. In the dynamic SLAM domain, where sensitivity to subtle geometric differences is pivotal, the exponential function emerges as a commonly employed and effective choice.
When there is no dynamic target in the image frame, the difference between the corresponding edges of the triangle should be in a small interval; to better describe this relationship, a geometric constraint score function
is defined as follows:
where
represents the Euclidean distance between the feature points
m and
n, and
represents the average distance between the two corresponding edges, which can be expressed as follows:
If a dynamic target appears in the scene, assuming that the point
is on the dynamic target, then
on the target image moves to the new position
, thus constituting the new triangle
, and the Euclidean distances between the two vertices of the triangle are
,
, and
. Because the position of the
point changes, the value of the geometric constraint score function calculated with the geometric constraint score function
will be abnormally large, so we need to eliminate the dynamic feature points, but the geometric constraint score involves two pairs of feature points, and real dynamic feature points are usually difficult to determine. In light of this, this study proposes a bidirectional feature point scoring strategy for identifying the real dynamic feature points in a scene. The main idea is to define the abnormal score of a characteristic point. When an exception appears on a side, one point each will be added to the abnormal scores of the two characteristic points on that side. In this way, there will be a significant gap between the real dynamic features and the static feature points. Moreover, it is easy to know that the anomalous score of a feature point represents how many feature points are needed to judge a point as an abnormal dynamic point. The geometric expression of an anomalous score of feature points is
where
is the anomalous score of the
ith characteristic point, and
represents the increment in the abnormal score, as follows:
where
is the mean score scale factor of the geometric constraint, which controls its strictness. Regarding the range of the average score scale factor
, it is typically set between 0.1 and 0.5, striking a balance between sensitivity to dynamic changes and stability in static scenes. During the tuning process, the final value for
was determined to be 0.2. This specific choice ensured that the geometric constraint scoring exhibited a robust response to dynamic targets in dynamic scenes while maintaining relative stability in static scenes.
denotes the mean geometric score between the pairs of points in an image:
where
n is the number of matching image feature points,
denotes the geometric scores of two matching feature points
i and
j, and
is the geometric error weight factor between the matching feature points, which reduces the influence of a large geometric constraint on the calculation of the mean score.
As shown in
Figure 4a, we extracted 500 ORB feature points from a dynamic image. The distribution of their abnormal scores is shown in
Figure 4b, where the x-axis represents the total number of feature points, the y-axis describes the anomalous score value for a specific feature point, and the red line represents the segmentation threshold, which means that when an abnormal score of a feature point is greater than the threshold, it is judged as a dynamic feature point. We set the adaptive dynamic segmentation threshold [
18] to
, where
M is the total number of extracted feature points, and
is set to 80% in
Figure 4.
2.2. Transformation Matrix Estimation
The pose transformation relationship between image frames can be represented by a fundamental matrix. In this study, we propose an algorithm that incorporates intrinsic constraints between samples to guide selective sampling with the aim of obtaining an improved fundamental matrix. Firstly, dynamic feature points in a scene were filtered out through coarse filtering. Then, the function for the evaluation of the similarity of key points in two images was improved, and it was optimized to achieve accurate matching of the key point pairs.
Figure 5 illustrates a conceptual diagram of the transformation matrix estimation.
If there exists a correct pair of matching points
and
,
is the distance from
to
,
is the distance from
to
, and the two distances are similar. We found the relationship between
and all associated points
in the first image and the relationship between
and all associated points
, then used their similarity to evaluate the correspondence of the two points; thus, the following evaluation functions are proposed:
where the average distance between
and
with each pair of associated points is
where
is the difference in similarity between
and
for each pair of associated points.
The following is the procedure for estimating the transformation matrix:
All values of are calculated.
The mean w of all is found.
is judged: If , and are correct similarities, and they are retained; otherwise, they are deleted.
The filtered point pair with the correct similarity is taken as the initial iterative feature point pair for the RANSAC algorithm.
The point pair with the correct similarity is used as a candidate matching feature set. Four groups are randomly selected to establish equations and calculate the unknowns in the transformation matrix M for the estimation of the transformation matrix.
The distances between other feature points and the candidate matching points are calculated by using the transformation model, and the threshold r is set. When the distance is less than this threshold, the feature point is determined to be an inlier; otherwise, it is an outlier.
The inliers are used to re-estimate the transformation matrix for N iterations.