Next Article in Journal
EEDC: An Energy Efficient Data Communication Scheme Based on New Routing Approach in Wireless Sensor Networks for Future IoT Applications
Previous Article in Journal
Performance Analysis of Relative GPS Positioning for Low-Cost Receiver-Equipped Agricultural Rovers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Method for Classifying Scenes for Simultaneous Localization and Mapping Using the Boundary Object Function Descriptor on RGB-D Points

by
Victor Lomas-Barrie
1,*,
Mario Suarez-Espinoza
2,
Gerardo Hernandez-Chavez
3 and
Antonio Neme
1
1
Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, Universidad Nacional Autonoma de Mexico, Mexico City 04510, Mexico
2
Facultad de Ingeniería, Universidad Nacional Autonoma de Mexico, Mexico City 04510, Mexico
3
Facultad de Ciencias, Universidad Nacional Autonoma de Mexico, Mexico City 04510, Mexico
*
Author to whom correspondence should be addressed.
Sensors 2023, 23(21), 8836; https://doi.org/10.3390/s23218836
Submission received: 3 October 2023 / Revised: 19 October 2023 / Accepted: 23 October 2023 / Published: 30 October 2023
(This article belongs to the Section Sensing and Imaging)

Abstract

:
Scene classification in autonomous navigation is a highly complex task due to variations, such as light conditions and dynamic objects, in the inspected scenes; it is also a challenge for small-factor computers to run modern and highly demanding algorithms. In this contribution, we introduce a novel method for classifying scenes in simultaneous localization and mapping (SLAM) using the boundary object function (BOF) descriptor on RGB-D points. Our method aims to reduce complexity with almost no performance cost. All the BOF-based descriptors from each object in a scene are combined to define the scene class. Instead of traditional image classification methods such as ORB or SIFT, we use the BOF descriptor to classify scenes. Through an RGB-D camera, we capture points and adjust them onto layers than are perpendicular to the camera plane. From each plane, we extract the boundaries of objects such as furniture, ceilings, walls, or doors. The extracted features compose a bag of visual words classified by a support vector machine. The proposed method achieves almost the same accuracy in scene classification as a SIFT-based algorithm and is 2.38× faster. The experimental results demonstrate the effectiveness of the proposed method in terms of accuracy and robustness for the 7-Scenes and SUNRGBD datasets.

Graphical Abstract

1. Introduction

Simultaneous localization and mapping (SLAM) is a critical problem in robotics and computer vision, which involves building a map of an unknown environment while simultaneously estimating the robot’s location within the map [1,2,3]. In recent years, RGB-D cameras have emerged as a popular sensing modality for SLAM systems, as they provide both color and depth information of the environment (Figure 1).
Scene classification in SLAM models that rely on the use of RGB-D cameras is a challenging task due to a number of factors [4,5]. Conventional image classification techniques like oriented FAST and rotated BRIEF (ORB) [6] and scale-invariant feature transform) [7] have been employed for scene classification within the SLAM context, utilizing only the 26 RGB channels. Yet, they do not consider depth. To address this problem, we propose a new method for scene classification in SLAM using the boundary object function (BOF) descriptor [8] on RGB-D points.
The BOF descriptor is a powerful technique for feature extraction and classification in computer vision. It converts the distance from the centroid to the points in the border of the object for each object found in a scene. The obtained distances are then used as the basis to classify the scene.
From an RGB-D camera, we extract points and fit them into orthogonal layers that are orthogonal to the camera plane. From each layer, we extract the boundaries of the detected objects, such as furniture, ceilings, walls, doors, etc. The extracted features are then classified using a machine learning method.
In this paper, we propose a new method for scene classification using the BOF descriptor on RGB-D points. Our method takes advantage of the RGB-D information provided by the camera and provides more robust and discriminative features for 3D scenes. We also use the concept of bag of visual words that are classified by an SVM, which allows us to handle complex scenes with high accuracy. Our experimental results demonstrate the effectiveness of the proposed method in terms of accuracy and robustness in different indoor scenes. We also provide experimental results to demonstrate the effectiveness of the proposed method in terms of accuracy and robustness in different indoor scenes.
The rest of the paper is organized as follows: Section 2 presents an overview of existing studies and contrasts them with the unique contributions of our research. Section 3 presents the proposed method in detail. Section 4 describes the experimental setup and presents the results. Finally, in Section 5, we present some conclusions and provide some directions for future work.

2. Related Work

Scene classification using RGB-D cameras is an active area of research in robotics and computer vision. In this section, we provide an overview of the related work in this field as well as a panoramic view of the state of the art.
Traditional image classification methods such as ORB or SIFT have been used for scene classification in SLAM systems. These methods rely on 2D image features and may not be sufficient for classifying 3D scenes accurately. In recent years, several methods have been proposed to address this problem [9].
A study of an RGB-D SLAM system for indoor dynamic environments used adaptive semantic segmentation tracking to improve localization accuracy and real-time performance, achieving a 90.57% accuracy increase over ORB-SLAM2 and creating a 3D semantic map for enhanced robot navigation [10].
Also, there is a pressing need to run scene or object detection algorithms on mobile objects such as robots and autonomous cars, where it is necessary to have lightweight algorithms that consume few computational resources (memory, processing time, and power). This is why algorithms based on that precept rescue simple feature extenders, as in [11]; the authors presented the modified R-ratio with the Viola–Jones classification method (MRVJCM) for efficient video retrieval, achieving 98% accuracy by automating image query recognition and optimizing system memory usage.
The BOF descriptor has been widely applied in several contexts. It was introduced in [8], where the descriptors allowed an accurate recognition of assembly pieces, including several shapes such as squares and circles; at the time, the orientation was determined by the shadow that the pieces projected. The images from which the BOF descriptors were obtained were taken from a camera located at the top of an assembly facility, which facilitated the detection of objects. A neural network, fuzzy ARTMAP, conducted the classification stage of the pieces, and the results were highly precise for all combinations. In a more recent application [12], it was applied in a technique to identify objects from several viewing perspectives. A condensed convolutional neural network model, inspired by LENET-5, was employed for the classification phase. This approach was implemented on an FPGA.
The BOF consists of a numeric vector used to describe the shape of an object. It differs from local feature extraction descriptors like SIFT, SURF, and ORB in that it describes the shape of an object but not the neighborhood of a feature point.
The steps to obtain a BOF descriptor are as follows:
  • Apply an object segmentation procedure.
  • Detect the contour and centroid of the object.
  • Quantize the contour into n points, where n is the size of the descriptor. With n = 180 , the test guarantees a good balance between accuracy and computer performance [13].
  • Obtain the distances from the quantized contour to the centroid.
  • Concatenate the distances in counterclockwise order of appearance.
  • Normalize the vector (the components are divided between the maximum components).
In recent years, the application of neural networks, in particular those with a deep learning architecture, in the field of scene classification has witnessed a significant increase. Heikel and Espinosa-Leal [14] implemented a YOLO-based object detector that gives a descriptor of each image this was put in Tf-idf representation; finally, the information was classified using random forest. The pipeline is similar to ours, with the difference being that we use a support vector machine for classification and BOF as the descriptor. Another deep learning approach is an autonomous trajectory planning method for robots to clean surfaces using RGB-D semantic segmentation, particularly employing the double attention fusion net (DAFNet), presented in [15]. This technique enhances indoor object segmentation and, through various processes, generates a smooth and continuous trajectory for the robotic arm, proving effective in surface cleaning tasks.
In Ref. [16], the authors combined deep learning and RGB-D sequences to take advantage of all the RGB-D information provided by Kinect. Their efforts included fussing the color and depth information with three techniques, namely, early, mid, and late fusion. A ConvNet-based method was used to extract descriptors due to the capacity of generalization that this type of structure allows. The results were significantly better in indoor scenarios than those obtained by the bag of visual words (BOVW) approach. The main drawback of the this ConvNet-based system is linked to the difficulty of its implementation in real-time situations due to the its high demand for computing power.
Semantic information is an important feature in interactive robot assistants. In Yuan et al. [17], the authors took advantage of the semantic segmentation provided by the Panoptic feature pyramid networks. This incorporation allows the system to create a semantic codebook, which divides the words in dynamic and static tokens. The rationale behind this approach is that the static words are more meaningful, whereas the dynamic ones have less value. For example, the word person has a value of zero because people cannot describe a place. Their descriptor is built upon a semantic graph, which also serves to define a similarity function.
Finally, a model in which the use of residual neural networks to optimize traffic sensor placement and a subsequent predict of the network-wide origin-to-destination flows is presented in [18]. The proposed deep learning model offers high prediction accuracy, relying on fewer sensors, as demonstrated on the Sioux Falls network.

3. Materials and Methods

In this section, we describe the materials and methods used in our proposed method for scene classification in SLAM using the BOF descriptor on RGB-D points.

3.1. Dataset and Platform

We based our experiments on three datasets (Table 1) for the training and testing stages: the Microsoft 7-Scenes [19], SUN RGB-D, and OfficeBot TourPath (OBTP) datasets, adhering to the train–test split as prescribed in the original publication [20]. The three datasets furnish color and depth information about the environment, a crucial requirement for our proposed method.
The results were procured using an Jetson Nano single-board computer (NVIDIA Corporation, Santa Clara, CA, USA) running on Ubuntu 18.04.6 LTS. The system specifications include a CPU clocked at 1.479 GHz and 4 GB RAM.

3.2. BOF Feature Extraction from RGB-D Images

In this method, we use only depth images to extract BOF features by following these steps:
  • The depth image is transformed into a point cloud, which is a set of 3D points representing the position of the objects in space captured by the image.
  • The point cloud is divided into layers. The number of layers is a hyperparameter L that is set before extracting the BOF features. We select an axis determined by a unitary vector v and project the points to v.
    p r o j v ( p ) = p · v
    After that, we obtain the minimum min v and the maximum max v of these projections and divide the interval [ min v , max v ] into L subintervals of length l = L 1 max v min v . Finally, using the function x , which rounds a float to an integer, an index I ( p ) is assigned to each point p by the following equation:
    I ( p ) = l ( p r o j v ( p ) min v )
    All points contained within a layer are projected to a plane perpendicular to the roll axis of the camera. In this manner, points are represented in the form of ( x , y ) for further analysis.
  • For each layer obtained in the previous step, a binary image of resolution W × H is generated, consisting of ones in the grids containing at least one point in space and zeros where there is no point. To determine if the pixel of the new binary image with index i , j is 0 or 1, we use an index function I ( x , y ) that assigns the two-dimensional integer formed by two coordinates to each projected point ( x , y ) in a layer, which results form the rounding function x according to:
    I ( x , y ) = ( l x ( x min x ) , l y ( y min y ) )
    where min x and min y are the minimums of the projections in the canonical axes x and y, l x = W 1 max x min x and l y = H 1 max y min y . Once the index I ( x , y ) is determined, the binary image is constructed following the next rule: given a pixel of the binary image ( i , j ) , if there exists ( x 0 , y 0 ) such that I ( x 0 , y 0 ) = ( i , j ) , we set the value of the pixel ( i , j ) to 1; otherwise, the value of the pixel ( i , j ) is set to 0.
  • The binary image is smoothed to eliminate the gaps caused by the low resolution of the point cloud. Smoothing is achieved using a closure morphological operation.
  • For each binary image, closed contours are found.
  • For each contour, the BOF descriptor is extracted following the steps discussed in Section 2.
  • All extracted BOF descriptors are stacked and associated to the frame.
Figure 2 illustrates the aforementioned process. It is important to note that only the depth image is take in into account, and the RGB image is kept aside. In Figure 2c, the multiple layers display objects highlighted with 1’s. A filter smooths the binary images to minimize noise. In Figure 2d, the Boundary Object Function is extracted solely from objects where the contour comprises a minimum of 1% of the total area.

3.3. Scene Classification

As a complement to autonomous navigation, scene recognition [22] endows an intelligent system with the ability to localize itself and understand the context of its surroundings. By recognizing the place where it is located, the intelligent system can adapt its actions to achieve its goals, e.g., for the case of a mobile robot, to move from one point to another or to plan based on location-derived information.
For this purpose, a scene recognition system based on traditional methodologies is proposed. This scheme is presented in Figure 3.
For the feature extraction stage, the traditional methodologies include algorithms such as SIFT, SURF, and ORB. In the feature transformation stage, BoVW approaches are commonly applied. For the classification stage, models such as support vector machine (SVM), random Fforest, naïve Bayes, or k-nearest neighbors (kNN) are commonly applied.
The contribution of this work involves following the BOF perspective as a feature extraction method. The reason for this is the relatively low computational demand required for obtaining of this descriptor compared with that of other commonly used local feature extraction schemes, such as the mentioned SIFT, SURF, and ORB methodologies.
SLAM algorithms need a loop closure mechanisms to ensure the correct generation of the map, detecting revisited places in order to add consistency and robustness. When the main sensor of the robot is a camera, it is referred to as appearance-based loop closure detection. In [23], these mechanisms belong to two categories, namely, offline and online. The former, to which our BOW approach belongs, needs a dictionary or database with information trained previously. Bag of binary words [24] is one of the most important exponents of the offline type. It was used, for example, in ORB-SLAM [25] and has been tested more recently in [26].
Given a training set of BOF descriptors, a codebook needs to be created. The codebook is an array of centroids c i . To represent a BOF descriptor ( B 1 , , B n ) as a word, we calculate the distance of each component B j with each centroid c i and select the closest. So, the vector ( c i 1 , , c i n ) is formed. Finally, the number f i counts the times that the centroid c i appears in ( c i 1 , , c i n ) ; the result is a k length vector ( f 1 , , f k ) , which represents the frequency that each word c i has in the BOF descriptor. All this process is summarized in the map:
( B 1 , , B n ) ( f 1 , , f k )

3.4. Loop-Closing Detection

We followed the method described in [27] to perform loop closing, under two constraints: first, we assume that the point clouds of visited frames are already stored; second, we use a simple bag of words dictionary without a tree structure. In other words, we apply k-means and not hierarchical k-means for its creation in order to keep computational complexity as low as possible.
The BOW descriptor obtained with Equation (4) needs to be described in Tf-idf representation with the following map:
( f 1 , , f k ) 1 1 k f i ( f 1 , , f n ) · ( w 1 , , w n )
The vector of weights ( w 1 , , w n ) is obtained in the training phase by:
w i = log | X t r a i n | ν i + 1
where | X t r a i n | is the number of BOF descriptors in the training set, and ν i counts those that contain the word c i .
The applied distance in the whole process is the L 1 -norm. The justification of relying on this metric comes from the results reported in [28], where it outperformed normalization. The BOW vector associated to the frames i and N are compared using the function:
s ( i , N ) = 1 1 2 v i | | v i | | v N | | v N | |
where N represents the label of the current frame. In order to normalize this function and given that the object of study is sequences of images, the following variation is used as a similitude score:
η ( i , N ) = s ( i , N ) s ( N γ , N )
where γ is an integer interval such that the frame N γ passes one second before the current frame N.
If s ( N γ , N ) is less than 0.1 , the frame is discarded; otherwise, the frame i * that maximizes η ( i , N ) is inspected. A time consistency check is carried out for this maximum, which consists of the replication of these steps for frames N T 1 , N T 2 , , N T m , validating that the corresponding maxima i * , i 1 * , i m * are indeed closed enough. Two thresholds α + and α are selected. If η ( i , N ) < α , the frame is discarded. If η ( i , N ) > α + , the frame is accepted as a loop-closing one. However, if η ( i , N ) is in the range ( α + , α ), a geometric verification using RANSAC over the point clouds corresponding to the frames i and N is needed.

3.5. Experimental Setup

We conducted experiments on a dataset of indoor scenes captured using an RGB-D camera. The dataset contains several scenes with different illumination conditions as well as distinct object configurations. We compared the performance of our proposed method with that of traditional image classification methods such as SIFT and GIST [29].
In the context of scene classification, we trained two models: the first one relies on BOF for the feature extraction stage, whereas the second is based on SIFT. Both models use BoVW and SVM for feature transformation and classification, respectively. For the purpose of this paper, we call the first method BOF-BoVW and the second SIFT-BoVW.
For the experiments, we used the Microsoft 7-Scenes dataset [19], which consists of RGB-D sequences (recordings) in 7 different zones. Each zone has different sequences. The zones are Chess, Fire, Heads, Office, Pumpkin, RedKitchen, and Stairs.
Also, we performed tests sing the SUN RGB-D dataset with the same train–test split as in the original publication [20]. The dataset consists of several thousands of images distributed along 19 labeled scenes; the split was chosen carefully by the authors in order to avoid the sparsity of the frames and allow a correct generalization (Figure 4). Originally this dataset was tested using a GIST descriptor linked to a SVM. The stack of the GIST descriptors applied to RGB and depth improved the results. The best results were achieved with the use of the Places-CNN descriptor and an RBF-SVM.
We were interested in comparing our model using this dataset because it is based on an an SVM approach. This provided a direct metric to compare our results with the existing ones.
In order to prove the effectiveness of the scene classification in real conditions, we tested the BoVW-BOF method with our own robot platform, which has a camera (RGB-D realsense model D45)5. For the training phase, we recorded 7 scenes in our laboratory: office_1, office_2, laboratory_1, corridor_1, corridor_2, corridor_3, and bathrooms. We recorded the depth and RGB images and collected them to create the OfficeBot TourPath (OBTP) dataset.
For the loop detection experiments, we concentrated on the chess sequences in the Microsoft 7-Scenes dataset. We followed the split for the training and testing sets as described in [30]. For the training set, we created a code book of 1024 words based on the BOFs descriptor extracted from the sequences; for testing, we used the third sequence. Then, we put each word in a TF-IDF representation and compared the similarity of the current frame with the one N frames behind, as stated in Section 3.4. After temporary verification, we fixed the thresholds α and α as in [27] in order to determine if a loop candidate is approved or discarded.
In the next list, we describe the parameters that modulate the behavior of the algorithm:
  • α + : Upper threshold that allows us to determine if a loop is accepted.
  • α + : Lower threshold that allows us to determine if a loop is discarded.
  • N: If the current keyframe is in position M, then the keyframe M N is used to calculate the normalization factor η ( M , M N ) .
  • τ N : The threshold that the normalizer has to exceed in order to be accepted.
  • TC req: Number of keyframes adjacent to the current frame that are required to declare it as valid in the temporary consistency check.
  • TC: Number of keyframes in which the temporary consistency check runs.
  • τ T C : Threshold that represents the maximum difference allowed between the index i * , i 1 * , , i M * that maximizes the normalized scores η of the frames adjacent to the current one.
  • keyframes: The number of frames that are considered in evaluation. It is the result of a homogeneous division of the number of total frames.
The next list contains the values returned as output by the algorithm [
  • Candidates: Number of keyframes that pass the upper α + threshold.
  • Approved: Number of candidates that pass the time consistency check.
  • Discarded: Number of keyframes that stay below the α threshold.

4. Results

4.1. Results for Scene Classification on Microsoft 7-Scenes Datasets

We first evaluated BOF-BoVW and SIFT-BoVW using the hold-out method, with 75% training data and 25% test data, from a single sequence per class.
In the classification stage and using cross-validation, we found that the optimal classifier parameters are C = 3.58 with an RBF kernel for BOF-BoVW and C = 0.01 with a linear kernel for BOF-BoVW. Figure 5 shows the confusion matrices resulting for the parameters mentioned. Table 2 shows that we observed an accuracy of 99% with our proposed method, almost reaching the accuracy of SIFT-BoVW, which has just one mismatching frame. This scenario has applications for a robot that navigates in the same building.
In the next stage, BOF-BoVW was evaluated using a sequence of frames different from the ones present in the training set as testing data. This scenario is applicable to robots that navigate in unknown buildings. In Figure 6, we show that our method decays to 34% accuracy, where the heads scene is the one with the best performance metrics. It can be observed that the three blocks in the central diagonal of Figure 6a are consistent. Conversely, SIFT-BoVW maintains high accuracy, where the decrease is justified by the unbalanced stairs class. From this, the diagonal in Figure 6b only fails in the last square. The Table 3 shows an accuracy of 34% for BOF-BoVW and 85% for SIFT-BoVW.

4.2. Results for the SUN RGB-D Dataset

The SUN RGB-D dataset allows testing the generalization capabilities of SVM models. To achieve the best results with the BOF descriptor, we set the number of layers to 20 in the point cloud. Each layer produces a binary image of 300 × 300 pixels, from which we obtain the contours. We requested that the area of the contour was at least one percent of the total binary image area. With this configuration, 164,972 vectors were obtained, leading to 5285 BOF descriptors, one for each frame of the training set.
The SUN RGB-D dataset is known for presenting several challenges, a fact that is confirmed by the confusion matrices displayed in Figure 7. It can be observed that the matrices are disperse, and just some squares of the diagonal are colored, indicating the difficulty of achieving low error. Along this line, the class of furniture store objects is the one with higgest F1 score. In Table 4, we show that in some scenes, such as some from the study space class, the SIFT-BoVW model achieves better results, whereas in others classes, such as the rest space one, the BOF-BoVW model obtains the best results. In terms of expected accuracy, both methods offer similar results.
Originally, in [20], the SUNRGB dataset was evaluated with a configuration of the GIST descriptor and an SVM as the classifier. In addition, the color and deep information were included in the evaluation. In Table 5, we compare our implementations with the traditional approaches. Of particular interest is the observation that BOF-BovW performs better than GIST with either RGB or depth information alone. From this, we conjecture that the use of both color and depth information is needed to improve the GIST performance.
The deep analysis of the performance of our model was based on the impact of the number of BOF descriptors per frame. We varied it from three to twenty in order to examine the changes in the classification metrics.

4.3. Results for Real Usage Conditions

We tested the the BoVW-BOF approach with our mobile robot platform; we built our own OBTP dataset (Table 1). For the training phase, we considered seven scenes; a total of 31,000 BOF descriptors were extracted from 1570 depth images. In the testing phase, the robot was launched on a different day with the same illumination conditions, and 920 frames were evaluated. Figure 8 shows two different confusion matrices. We noticed that the corridors were similar scenes in terms of the absence of characteristic objects. Also, the office_2 scenes had less training frames than the rest. So, in Figure 8b, we restrict our scenes to the those determinants resulting in an improvement in accuracy of up to 86% (Table 6).
In order to check the efficiency and performance of the described method, an ROC curve was generated (Figure 9) on the OBTP dataset. It can be observed that most of the scenes are satisfactorily classified, except for the corridor_1 scene. The main reason for this discrepancy is the significant imbalance in the number of frames in that scene compared to the remaining ones. For the latter scenes, the area under the curve (AUC) is above 0.92.

4.4. Results for Time Performance

The main objective of using BOF over SIFT is to reduce the computational complexity associated with the whole process, which includes memory (hardware) and processing time, to enable real-time recognition on single-board computers. To compare the consumption of computational resources, a comparison is made between the use of BOF and SIFT descriptors.
Our results are presented from two aspects: CPU usage time and a stage that we call “real time”. The CPU time combines user and kernel times and accounts for each core in multi-core processors. The real-time aspect refers to the total elapsed time from the start to the end of the process, not considering individual core times. In multi-core processors, these measurements can differ, especially if processes run in parallel, which may influence the actual time in order to make it shorter than the CPU time.
The processes evaluated in Table 7 are
  • Extraction of descriptors from a frame, which is the average value obtained from 10 runs on the same frame is considered as the relevant quantity.
  • Extraction of descriptors from multiple frames, where 1000 frames were processed.
  • Generation of a visual word vocabulary, consisting of 1024 words. For BOF-BoVW, a three-layer case was computed on 34,000 samples. BOF-BoVW 20-layer case was computed over 190,000 samples, and the SIFT-BoVW case was computed on 150,000 samples.
  • Further transformation to a BoVW TF-EDF representation using the 1024 words dictionary.
  • Training of the model using pre-defined parameters. SVM was trained using the parameters previously mentioned.
  • Classification: quantification of the classification performance over 1625 samples using the SVM model trained in point 5.
  • Computing the total representation time. This is the sum of the results from points 2 and 4.
  • Computing of the total offline phase. It is defined as the sum of the results from points 3 and 5.
  • Computing of the total online phase, which consists of the sum of the results from points 2, 4, and 6.
Table 7. Comparison of time performance results.
Table 7. Comparison of time performance results.
Process No.BOF-BoVW 3 LayersBOF-BoVW 20 LayersSIFT-BoVW
CPU (s) Real (s) CPU (s) Real (s) CPU (s) Real (s)
10.310.300.460.390.320.26
2294.87295.68392.12353.83486.92292.70
373.0018.399710.732871.723962.711354.93
4107.5927.93148.1738.52533.62137.86
593.7894.37100.78100.6916.2316.33
627.8127.9731.1731.366.206.20
7402.46323.62540.30392.351020.54430.55
8166.78112.769811.512972.423978.941371.25
9430.27351.59571.47423.711026.74436.75
In order to better understand the comparison of BOF-BoVW and SIFT-BoVW, we present the percentage increases for the listed cases Table 8. Increases are computed using the equation I = ( ( V f V o ) / V o ) × 100 , where I is the percentage increase, V f the final value, and V o the initial value. Increase B-S 3 means the percentage increase using BOF-BoVW with 3 layers as the initial value and SIFT-BoVW as the final value. The same reasoning is followed for Increase B-S 20, but relying on a BOF-BoVW with 20 layers as the initial value.
In terms of memory usage, the results for the sequence 01 train split of the Microsoft 7-Scenes dataset are shown in Table 9. The most relevant result can be observed in the first row, where SIFT descriptors need 1.9 GB. However, BOF descriptors with three layers need 49.4 MB, which translates into an increase of 3746% of storage needed. Using our heaviest 20 layers BOF representation leads to an increase of 593%. Maintaining descriptors over time is important if an implementation in a SLAM system is sought, due to the importance of reusing information from previous frames already visited, in order to speed up tasks such as loop closure detection. We observed that the BOW TF-IDF representations in both descriptors is almost identical, which can be explained by the fact that the model mainly depends on the codebook and the numbers of words in it. The other files that need to be stored are the codebook and model trained, and these remain in the megabyte scale in both cases.

4.5. Results of Loop Detection

In Table 10, we display the results of the loop closure implementation. If we modify the parameter corresponding to the temporary consistency check ( τ T C , TC req, TC), the approved rate is doubled, as shown in Figure 10b, which is contrasted with what is displayed in Figure 10a. The change in thresholds α + and α does not have a significant impact on the discarded rate parameter, and just seven loops more are approved in Figure 10b,c.
Finally, we can also augment the gap between keyframes, which leads to a gain in processing speed, at the cost of reduced resolution. The lack of candidates and approved frames in Figure 10e is explained by the fact that we set the parameter keyframe to every two normal frames instead of one, and we did not adjust the remaining parameters to stay proportional with this new distribution of keyframes. This is displayed in Figure 10f. Despite having a lower value for the keyframes parameter, we achieved similar rates of approval and discard by means of tuning the relevant parameters.
The manner in which we implemented the loop-closing detection procedures is derived from counting with a bag of visual words representation for the scene classification phase. However, the fern approach in [30] seems to be adaptable to our descriptor in the following way: each BOF descriptor has 180 entries, so we can set 180 thresholds τ i uniformly sampled and create a new binary vector, which contains a one if the corresponding BOF entry passes the threshold and zero otherwise.
In order to merge the results obtained in both parts, the classification and loop detection stages, a dataset needs to meet two requirements: to be divided in scenes and to contain a path that passes by those scenes. In this way, a semantic verification step immediately before the time consistency stage can be implemented in order to use this semantic information.

5. Conclusions

Scene recognition and classification are open problems in the robotics, vision, and pattern recognition fields. In this paper, we described a novel method able to cope with complex scenes at the time that keeps computational complexity low. Our method achieves performance comparable to that of more demanding architectures. The recognition and classification model we developed achieves performance that is comparable to that of other relevant models in a time with a significantly lower computing demand.
The main purpose of the BOF descriptors is to be lightweight, that is, to reduce computational complexity in both space (memory use and hardware resources) and processing time. Using a relatively shallow architecture of only three layers and configuring the online processes (descriptor extraction, BoW representation, and classification) took 596 s less than the one with SIFT and was 2.38× faster, which is an important result because of the calculations that the onboard machine of the robot must complete. Furthermore, the offline processes (codebook generation and model training) also are more than 20 times faster in CPU time with the three-layer configuration. This opens the possibility of considering the implementation of a training phase on board to adjust the models trained offline.
The best scene recognition results were achieved with a configuration of 20 layers per frame. The results are comparable to those obtained with SIFT-based models, at least on the the two datasets we considered here. Also, we implemented an efficient completed loop-closing module. Furthermore, our method was able to rely on semantic information derived from the scenes. A particularly relevant next step in our research is the implementation of this module in a lightweight semantic SLAM system.
We presented the results of our approach in several tables and figures in Section 4, which are comparable to those obtained by more popular methods. At the same time, the significantly less computation needed by our approach was proven in the corresponding analyses. We consider this latter attribute to be one of the main contributions of our work.
An additional advantage of our method is that the number of descriptors and their size take up less space in the CPU’s RAM. While SIFT-BoVW uses 1.9 GB, BOF-BoVW (20 layers) requires only 274 MB (Table 9). On some small-form-factor computers, it would be challenging to load the operating system and run the algorithm with SIFT; however, using the BOF descriptor for scene classification overcomes this issue. Remember that the longer the autonomous navigation journey, the more descriptors are needed for both SIFT and BOF.

Future Work

A natural follow-up experiment involves testing the entire SLAM algorithm on the two datasets descxribed in this paper. Moreover, our model can be embedded in a robot with omnidirectional wheels to confirm that the point cloud capture remains unaffected by potential camera warping. Given the robot’s primarily smooth horizontal movement and the camera’s fixed position, the point cloud is anticipated to maintain a consistent distance from the floor to the sensor without any tilt.
Currently, classification methods using deep learning are very competitive tools and reach extensive generalization ranges. So, we will seek to move away from classification using SVM and opt for a deep learning model that classifies the BOFs of each layer of each frame of each scene. Unlike the images to be classified with these algorithms, in this method, the vectors are made up of 180 values. This enables reductions in the number of inputs in convolutional networks and in the number of parameters.
As a possible extension of our work, a different alternative is to consider descriptors other than BOF in order to consider the placement and sequence of each point in the depth matrix. This aims to bypass the projection of points onto the layers.

Author Contributions

Conceptualization, V.L.-B., M.S.-E. and G.H.-C.; methodology, V.L.-B., M.S.-E. and G.H.-C.; software, M.S.-E. and G.H.-C.; validation, V.L.-B., M.S.-E., G.H.-C. and A.N.; formal analysis, V.L.-B. and A.N.; investigation, V.L.-B., M.S.-E., G.H.-C. and A.N.; resources, V.L.-B.; data curation, M.S.-E., G.H.-C. and A.N.; writing—original draft preparation, V.L.-B., M.S.-E. and G.H.-C.; writing—review and editing, V.L.-B., M.S.-E., G.H.-C. and A.N.; visualization, M.S.-E. and G.H.-C.; supervision, V.L.-B. and A.N.; project administration, V.L.-B.; funding acquisition, V.L.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by PAPIIT and PAPIME, DGAPA, UNAM under grant numbers TA100721, TA101523, TA101323, and PE111223. The APC was funded by PAPIIT, DGAPA, UNAM grant number TA100721.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and pretrained models can be found at https://github.com/victorlomas/public (accessed on 2 October 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BOFBoundary object function
BoVWBag of visual words
ORBOriented FAST and rotated BRIEF
RBFRadial basis function
SIFTScale-invariant feature transform
SLAMSimultaneous localization and mapping
SURFSped-up robust features
SVMSupport vector machine
Tf-idfTerm frequency-inverse document frequency

References

  1. Gupta, A.; Fernando, X. Simultaneous Localization and Mapping (SLAM) and Data Fusion in Unmanned Aerial Vehicles: Recent Advances and Challenges. Drones 2022, 6, 85. [Google Scholar] [CrossRef]
  2. Xu, X.; Zhang, L.; Yang, J.; Cao, C.; Wang, W.; Ran, Y.; Tan, Z.; Luo, M. A Review of Multi-Sensor Fusion SLAM Systems Based on 3D LIDAR. Remote Sens. 2022, 14, 2835. [Google Scholar] [CrossRef]
  3. Barros, A.; Michel, M.; Moline, Y.; Corre, G.; Carrel, F. A Comprehensive Survey of Visual SLAM Algorithms. Robotics 2022, 11, 24. [Google Scholar] [CrossRef]
  4. Eldemiry, A.; Zou, Y.; Li, Y.; Wen, C.Y.; Chen, W. Autonomous Exploration of Unknown Indoor Environments for High-Quality Mapping Using Feature-Based RGB-D SLAM. Sensors 2022, 22, 5117. [Google Scholar] [CrossRef] [PubMed]
  5. Lu, Q.; Pan, Y.; Hu, L.; He, J. A Method for Reconstructing Background from RGB-D SLAM in Indoor Dynamic Environments. Sensors 2023, 23, 3529. [Google Scholar] [CrossRef]
  6. Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
  7. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  8. Peña-Cabrera, M.; Lopez-Juarez, I.; Rios-Cabrera, R.; Corona-Castuera, J. Machine vision approach for robotic assembly. Assem. Autom. 2005, 25, 204–216. [Google Scholar] [CrossRef]
  9. Ma, J.; Jiang, X.; Fan, A.; Jiang, J.; Yan, J. Image Matching from Handcrafted to Deep Features: A Survey. Int. J. Comput. Vis. 2021, 129, 23–79. [Google Scholar] [CrossRef]
  10. Wei, S.; Li, Z. An RGB-D SLAM algorithm based on adaptive semantic segmentation in dynamic environment. J. Real-Time Image Process. 2023, 20, 85. [Google Scholar] [CrossRef]
  11. Sathiyaprasad, B. Ontology-based video retrieval using modified classification technique by learning in smart surveillance applications. Int. J. Cogn. Comput. Eng. 2023, 4, 55–64. [Google Scholar] [CrossRef]
  12. Lomas-Barrie, V.; Silva-Flores, R.; Neme, A.; Pena-Cabrera, M. A Multiview Recognition Method of Predefined Objects for Robot Assembly Using Deep Learning and Its Implementation on an FPGA. Electronics 2022, 11, 696. [Google Scholar] [CrossRef]
  13. Lomas-Barrie, V.; Pena-Cabrera, M.; Lopez-Juarez, I.; Navarro-Gonzalez, J.L. Fuzzy ARTMAP-Based Fast Object Recognition for Robots Using FPGA. Electronics 2021, 10, 361. [Google Scholar] [CrossRef]
  14. Heikel, E.; Espinosa-Leal, L. Indoor Scene Recognition via Object Detection and TF-IDF. J. Imaging 2022, 8, 209. [Google Scholar] [CrossRef] [PubMed]
  15. Qi, L.; Gan, Z.; Hua, Z.; Du, D.; Jiang, W.; Sun, Y. Cleaning of object surfaces based on deep learning: A method for generating manipulator trajectories using RGB-D semantic segmentation. Neural Comput. Appl. 2023, 35, 8677–8692. [Google Scholar] [CrossRef]
  16. Xu, G.; Li, X.; Zhang, X.; Xing, G.; Pan, F. Loop Closure Detection in RGB-D SLAM by Utilizing Siamese ConvNet Features. Appl. Sci. 2021, 12, 62. [Google Scholar] [CrossRef]
  17. Yuan, Z.; Xu, K.; Zhou, X.; Deng, B.; Ma, Y. SVG-Loop: Semantic–Visual–Geometric Information-Based Loop Closure Detection. Remote Sens. 2021, 13, 3520. [Google Scholar] [CrossRef]
  18. Alshehri, A.; Owais, M.; Gyani, J.; Aljarbou, M.H.; Alsulamy, S. Residual Neural Networks for Origin-Destination Trip Matrix Estimation from Traffic Sensor Information. Sustainability 2023, 15, 9881. [Google Scholar] [CrossRef]
  19. Shotton, J.; Glocker, B.; Zach, C.; Izadi, S.; Criminisi, A.; Fitzgibbon, A. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2930–2937. [Google Scholar] [CrossRef]
  20. Song, S.; Lichtenberg, S.P.; Xiao, J. SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar] [CrossRef]
  21. Tychola, K.A.; Tsimperidis, I.; Papakostas, G.A. On 3D Reconstruction Using RGB-D Cameras. Digital 2022, 2, 401–421. [Google Scholar] [CrossRef]
  22. Xie, L.; Lee, F.; Liu, L.; Kotani, K.; Chen, Q. Scene recognition: A comprehensive survey. Pattern Recognit. 2020, 102, 107205. [Google Scholar] [CrossRef]
  23. Garcia-Fidalgo, E.; Ortiz, A. iBoW-LCD: An Appearance-Based Loop-Closure Detection Approach Using Incremental Bags of Binary Words. IEEE Robot. Autom. Lett. 2018, 3, 3051–3057. [Google Scholar] [CrossRef]
  24. Galvez-López, D.; Tardos, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
  25. Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
  26. Gaspar, A.R.; Nunes, A.; Matos, A. Evaluation of Bags of Binary Words for Place Recognition in Challenging Scenarios. In Proceedings of the IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Santa Maria da Feira, Portugal, 28–29 April 2021; pp. 19–24. [Google Scholar] [CrossRef]
  27. Cadena, C.; Galvez-López, D.; Tardos, J.D.; Neira, J. Robust Place Recognition With Stereo Sequences. IEEE Trans. Robot. 2012, 28, 871–885. [Google Scholar] [CrossRef]
  28. Muja, M.; Lowe, D.G. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. In Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, 5–8 February 2009; pp. 331–340. [Google Scholar] [CrossRef]
  29. Oliva, A. CHAPTER 41—Gist of the Scene. In Neurobiology of Attention; Itti, L., Rees, G., Tsotsos, J.K., Eds.; Academic Press: Burlington, NJ, USA, 2005; pp. 251–256. [Google Scholar] [CrossRef]
  30. Glocker, B.; Izadi, S.; Shotton, J.; Criminisi, A. Real-time RGB-D camera relocalization. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Adelaide, Australia, 1–4 October 2013; pp. 173–179. [Google Scholar] [CrossRef]
Figure 1. Number of research articles by year with the keywords: “RGB-D AND SLAM” in Scopus and IEEE Xplore from 2011 until August 2023.
Figure 1. Number of research articles by year with the keywords: “RGB-D AND SLAM” in Scopus and IEEE Xplore from 2011 until August 2023.
Sensors 23 08836 g001
Figure 2. BOF feature extraction process. (a) RGB image, (b) depth image, (c) binary images representing various layers at different depths, and (d) object BOF (in red) and centroid (in blue) from several layers, (e) BOF descriptors per layer, and (f) BOF descriptors stacked and associated to the frame. Contours with at least 1% of the total area are indicated by green highlighting.
Figure 2. BOF feature extraction process. (a) RGB image, (b) depth image, (c) binary images representing various layers at different depths, and (d) object BOF (in red) and centroid (in blue) from several layers, (e) BOF descriptors per layer, and (f) BOF descriptors stacked and associated to the frame. Contours with at least 1% of the total area are indicated by green highlighting.
Sensors 23 08836 g002
Figure 3. General scheme for image classification.
Figure 3. General scheme for image classification.
Sensors 23 08836 g003
Figure 4. Train and test number of frames per scenes for the SUN RGB-D dataset. The scenes labels, from left to right, are study space, rest space, office, living room, library, lecture theatre, lab, kitchen, home office, furniture store, discussion area, dining room, dining area, corridor, conference room, computer room, classroom, bedroom, and bathroom. (a) Train split with 5285 frames. (b) Test split with 5550 frames.
Figure 4. Train and test number of frames per scenes for the SUN RGB-D dataset. The scenes labels, from left to right, are study space, rest space, office, living room, library, lecture theatre, lab, kitchen, home office, furniture store, discussion area, dining room, dining area, corridor, conference room, computer room, classroom, bedroom, and bathroom. (a) Train split with 5285 frames. (b) Test split with 5550 frames.
Sensors 23 08836 g004
Figure 5. Confusion matrix for BOF-BoVW (a) and SIFT-BoVW (b) using hold-out method with 25% test data.
Figure 5. Confusion matrix for BOF-BoVW (a) and SIFT-BoVW (b) using hold-out method with 25% test data.
Sensors 23 08836 g005
Figure 6. Confusion matrix for BOF-BoVW (a) and SIFT-BoVW (b) using a different sequence as test data.
Figure 6. Confusion matrix for BOF-BoVW (a) and SIFT-BoVW (b) using a different sequence as test data.
Sensors 23 08836 g006
Figure 7. Confusion matrices of the test split for the SUN RGB-D dataset.
Figure 7. Confusion matrices of the test split for the SUN RGB-D dataset.
Sensors 23 08836 g007
Figure 8. Confusion matrices of the test split on the OBTP dataset.
Figure 8. Confusion matrices of the test split on the OBTP dataset.
Sensors 23 08836 g008
Figure 9. Receiver Operating Characteristic curve on the OBTP-DS.
Figure 9. Receiver Operating Characteristic curve on the OBTP-DS.
Sensors 23 08836 g009
Figure 10. The detection of approved loop-closing candidates are shown in red for chess sequence 03. Each image corresponds to one row in Table 10, from left to right.
Figure 10. The detection of approved loop-closing candidates are shown in red for chess sequence 03. Each image corresponds to one row in Table 10, from left to right.
Sensors 23 08836 g010
Table 1. Datasets used.
Table 1. Datasets used.
CharacteristicMicrosoft 7-ScenesSUN RGB-DOBTP
Year201320152023
CameraKinect RGB-DIntel RealSense, Asus Xtion, and Kinect v1/2 [21]RealSense D455
Sensor typeInfrared camera and IR projectorStructured light and TOFActive IR stereo
Depth resolution640 × 480628 × 468, 640 × 480, 640 × 480, and 512 × 424640 × 480
Color resolution640 × 4801920 × 1080, 640 × 480, 640 × 48, and 1920 × 1080640 × 480
Number of scenes7197
Number of images per scene500 to 100080 to 600200 to 800
Frame file formattingColor (PNG), depth (PNG) image, and pose (txt)RGB-D, depth, and segmentation mapsColor (PNG) and depth (PNG) image
Table 2. Classification results for BOF-BoVW and SIFT-BoVW using hold-out method with 25% test data.
Table 2. Classification results for BOF-BoVW and SIFT-BoVW using hold-out method with 25% test data.
BOF-BoVWPrecisionRecallF1 ScoreSupportSIFT-BoVWPrecisionRecallF1 ScoreSupport
chess0.990.980.99263chess1.001.001.00263
fire0.991.000.99239fire1.001.001.00239
heads0.981.000.99247heads1.001.001.00247
office0.990.980.99262office1.001.001.00262
pumpkin1.000.990.99249pumpkin1.001.001.00249
redkitchen1.000.990.99245redkitchen1.001.001.00245
stairs1.001.001.00120stairs1.001.001.00120
accuracy 0.991625accuracy 1.001625
macro avg0.990.990.991625macro avg1.001.001.001625
weighted avg0.990.990.991625weighted avg1.001.001.001625
Table 3. Classification report for BOF-BoVW and SIFT-BoVW, using a different sequence as test data. Values rounded to 2 decimal places.
Table 3. Classification report for BOF-BoVW and SIFT-BoVW, using a different sequence as test data. Values rounded to 2 decimal places.
BOF-BoVWPrecisionRecallF1 ScoreSupportSIFT-BoVWPrecisionRecallF1 ScoreSupport
chess0.270.380.321000chess0.950.870.911000
fire0.360.290.321000fire0.940.990.971000
heads0.460.610.521000heads0.590.930.721000
office0.280.450.341000office0.860.920.891000
pumpkin0.560.370.451000pumpkin0.990.980.991000
redkitchen0.150.090.111000redkitchen0.940.840.891000
stairs0.050.010.01500stairs0.090.000.00500
accuracy 0.346500accuracy 0.856500
macro avg0.300.310.306500macro avg0.770.790.776500
weighted avg0.320.340.326500weighted avg0.820.850.826500
Table 4. Classification report for the SUN RGB-D dataset rounded to two decimals. The test split contains 5050 frames.
Table 4. Classification report for the SUN RGB-D dataset rounded to two decimals. The test split contains 5050 frames.
ScenePrecisionRecallF1 ScoreSupport
(a) BOF-BoVW
study space0.000.000.00127
rest space0.160.320.22533
office0.140.250.18540
living room0.140.070.10255
library0.350.030.05221
lecture theatre0.000.000.0043
lab0.000.000.00223
kitchen0.220.080.11276
home office0.000.000.00128
furniture store0.310.550.39380
discussion area0.000.000.00117
dining room0.000.000.0096
dining area0.150.040.06237
corridor0.290.160.20196
conference room0.000.000.00207
computer room0.290.030.0567
classroom0.190.250.21520
bedroom0.240.400.30578
bathroom0.300.230.26306
accuracy 0.215050
macro avg0.150.130.115050
weighted avg0.180.210.175050
ScenePrecisionRecallF1 ScoreSupport
(b) SIFT-BoVW
study space0.020.010.01127
rest space0.150.200.17533
office0.220.340.27540
living room0.100.110.10255
library0.140.080.10221
lecture theatre0.020.050.0343
lab0.090.000.01223
kitchen0.140.150.15276
home office0.110.120.12128
furniture store0.360.540.43380
discussion area0.020.020.02117
dining room0.070.070.0796
dining area0.110.080.09237
corridor0.100.100.10196
conference room0.180.080.11207
computer room0.100.150.1267
classroom0.330.230.27520
bedroom0.290.260.27578
bathroom0.360.350.36306
accuracy 0.215050
macro avg0.150.150.155050
weighted avg0.200.210.205050
Table 5. Accuracy comparison of descriptors tested on the SUN RGB-D dataset. In this case, the values are truncated. The GIST results were extracted from [20].
Table 5. Accuracy comparison of descriptors tested on the SUN RGB-D dataset. In this case, the values are truncated. The GIST results were extracted from [20].
BOF-BoVWSIFT-BoVWGIST RGBGIST DEPTHGIST RGB + DEPTH
Accuracy20.5320.8719.720.123
Table 6. Classification report for the OBPT dataset rounded to two decimals.
Table 6. Classification report for the OBPT dataset rounded to two decimals.
ScenePrecisionRecallF1 ScoreSupport
(a) 7 scenes’ classification
office_10.890.890.89135
office_20.950.530.68135
laboratory0.630.740.68135
corridor_10.530.350.43110
corridor_20.550.770.64135
corridor_30.700.770.73135
bathrooms0.660.700.68135
accuracy0.690.690.69920
macro avg0.700.680.68920
weighted avg0.710.690.68920
(b) 4 scenes’ classification
office0.910.870.89165
laboratory0.810.800.80165
corridor0.940.980.96165
bathrooms0.780.790.79165
accuracy0.860.860.86660
macro avg0.860.860.86660
weighted avg0.860.860.86660
Table 8. Percent increases in time consumption.
Table 8. Percent increases in time consumption.
Process No.Increase B-S 3Increase B-S 20
CPU (%) Real (%) CPU (%) Real (%)
11 13 31 33
265 1 24 17
353287266 59 53
4396394260258
5 83 83 84 84
6 78 78 80 80
7154338910
822861116 59 54
913924803
Table 9. Comparison of storage usage.
Table 9. Comparison of storage usage.
FileBOF-BoVW 3 LayersBOF-BoVW 20 LayersSIFT-BoVW
Raw descriptors49.4 MB274 MB1.9 GB
BoVW TF-IDF representation19.9 MB20 MB20 MB
Codebook1.47 MB1.47 MB524 KB
Trained model32.9 MB35.4 MB15.8 MB
Table 10. Loop closure detection results for the chess sequence 03.
Table 10. Loop closure detection results for the chess sequence 03.
α + α N τ N TC ReqTC τ TC Key FramesCandidatesApprovedDiscarded
0.60.15 31 0.15316010003470
0.60.15 30 0.131560100035160
0.60.15 15 0.05315601000701917
0.50.3 15 0.05315601000732617
0.50.3 15 0.0531560500819
0.50.3 2 0.051320200241011
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lomas-Barrie, V.; Suarez-Espinoza, M.; Hernandez-Chavez, G.; Neme, A. A New Method for Classifying Scenes for Simultaneous Localization and Mapping Using the Boundary Object Function Descriptor on RGB-D Points. Sensors 2023, 23, 8836. https://doi.org/10.3390/s23218836

AMA Style

Lomas-Barrie V, Suarez-Espinoza M, Hernandez-Chavez G, Neme A. A New Method for Classifying Scenes for Simultaneous Localization and Mapping Using the Boundary Object Function Descriptor on RGB-D Points. Sensors. 2023; 23(21):8836. https://doi.org/10.3390/s23218836

Chicago/Turabian Style

Lomas-Barrie, Victor, Mario Suarez-Espinoza, Gerardo Hernandez-Chavez, and Antonio Neme. 2023. "A New Method for Classifying Scenes for Simultaneous Localization and Mapping Using the Boundary Object Function Descriptor on RGB-D Points" Sensors 23, no. 21: 8836. https://doi.org/10.3390/s23218836

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop