Intensity/Inertial Integration-Aided Feature Tracking on Event Cameras

Li, Zeyu; Liu, Yong; Zhou, Feng; Li, Xiaowan

doi:10.3390/rs14081773

Open AccessArticle

Intensity/Inertial Integration-Aided Feature Tracking on Event Cameras

¹

College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266590, China

²

Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou 310027, China

³

National Time Service Center, Chinese Academy of Sciences, Xi’an 710600, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(8), 1773; https://doi.org/10.3390/rs14081773

Submission received: 29 January 2022 / Revised: 2 April 2022 / Accepted: 3 April 2022 / Published: 7 April 2022

(This article belongs to the Special Issue Advanced Technologies for Position and Navigation under GNSS Signal Challenging or Denied Environments)

Download

Browse Figures

Versions Notes

Abstract

:

Achieving efficient and accurate feature tracking on event cameras is a fundamental step for practical high-level applications, such as simultaneous localization and mapping (SLAM) and structure from motion (SfM) and visual odometry (VO) in GNSS (Global Navigation Satellite System)-denied environments. Although many asynchronous tracking methods purely using event flow have been proposed, they suffer from high computation demand and drift problems. In this paper, event information is still processed in the form of synthetic event frames to better adapt to the practical demands. Weighted fusion of multiple hypothesis testing with batch processing (WF-MHT-BP) is proposed based on loose integration of event, intensity, and inertial information. More specifically, with inertial information acting as priors, multiple hypothesis testing with batch processing (MHT-BP) produces coarse feature-tracking solutions on event frames in a batch processing way. With a time-related stochastic model, a weighted fusion mechanism fuses feature-tracking solutions from event and intensity frames compared with other state-of-the-art feature-tracking methods on event cameras. Evaluation on public datasets shows significant improvements on accuracy and efficiency and comparable performances in terms of feature-tracking length.

Keywords:

event camera; feature tracking; intensity/inertial integration

Graphical Abstract

1. Introduction

Event cameras, as a kind of bio-inspired sensor, trigger events on each pixel independently and asynchronously according to the changes of scene brightness. Compared with standard cameras, they output event flow, which is formed by coordinates on the image plane and the time when an event happens, namely x,y and t (µs or ns level). Another characteristic of event flow is polarity, which indicates the increase or decrease in brightness. The presence or absence of polarity information depends on manufacturers. For example, DAVIS 240C from Inivation provides polarity information, while IMX636 from Sony/Prophesee does not have polarity information.

Event cameras have the characteristics of low latency, high dynamic range, and low power consumption. Due to their different characteristics compared to traditional cameras, event cameras open up a new paradigm for a series of tasks such as VO (Visual Odometry), SLAM (Simultaneous Localization and Mapping), and SfM (Structure from Motion). Feature tracking on event cameras is one fundamental step toward the maturity of these practical applications, which have aroused the interest of a wide range of researchers [1,2].

Although many asynchronous feature-tracking methods have been proposed, template matching on event frames or event patches is still a major way to process event information, especially for high-level tasks [3]. By accumulating a certain number of events or calculating the significance of the incoming event, the contours of objects in the scene are formed in an image-like frame, which is closely related to the movement of the carrier.

Practical feature tracking on event frames still faces challenges from efficiency and accuracy problems. Efficiency is related to high event rates, which depends on carrier motions, scene (e.g., dynamic objects), texture, etc. Event rates may vary significantly, and thus, the frequency of event frames may be very high, increasing the computational difficulty of keeping tracking on a sufficient number of features.

Event flow is sparse, asynchronous, noisy, and only represents brightness changes; thus, event-to-frame transformation has the problem of low signal-to-noise ratio and “texture” loss. Therefore, feature tracking purely relying on event flow may have drift problems, deteriorating the accuracy and robustness of high-level tasks.

Purely relying on event flow information will lead to one problem: when the carrier is moving slowly, the time interval between event frames may be larger than that between images. The frequency of positioning or mapping solutions would not meet the requirements of users. For example, when a drone is performing autonomous exploration slowly in an unknown environment, timely positioning results are still needed for motion and path planning. Therefore, intensity images and inertial information can still provide timely updates as the basic support for high-level tasks.

From the view of bionics, the research on animal processing mechanisms for external information shows that different parts of the brain handle different senses with different attention, which jointly supports the decision of action and judgment. Visual information is not an exception, which is highly related to events, as events only happen when brightness changes. Both global views of the scenes and inertial information are still sensed and processed by the brain latently. Therefore, the fusion of event, intensity, and inertial information has support from bionic research [4,5].

Now, several event cameras, such as DAVIS 346 and CeleX-4, provide normal intensity images, angular velocity, and acceleration from embedded IMU (Inertial Measurement Unit), supporting the feasibility of using complementary information for feature tracking.

The advantages of multiple sensor fusion bring potentials to overcome accuracy and efficiency problems from purely event-information-based methods. Therefore, a new feature-tracking method on event frames is proposed in this paper. Its novelty can be summarized as follows:

An event-frame-based feature tracker by using multiple hypothesis testing with batch processing (MHT-BP) is proposed to provide initial tracking solution. In MHT-BP, four-parameter affine transformation is proposed to improve motion coverage of template matching, and batch processing is proposed to improve tracking efficiency.
Together with inertial information prediction, a time-related stochastic model and a constant-velocity model are proposed to loosely integrate the solutions of tracking solution from intensity image and initial tracking solutions, which improves tracking accuracy.
A comparison with other state-of-the-art methods is conducted on publicly available event-camera datasets in terms of tracking efficiency, accuracy, and length. The results show that the proposed method achieved significantly higher accuracy and efficiency and comparable feature-tracking lengths.

The rest of paper is constructed as follows: Section 2 reviews feature-detection and -tracking methods on event camera. Section 3 firstly illustrates the data stream that the proposed method deals with and gives a brief description of WF-MHT-BP and then presents inertial based rotation prediction, which acts as the priors for the next feature-tracking steps in Section 3.1. The method to generate event frame is introduced in Section 3.2. After that, MHT-BP, which is the tracking method purely relying on event frames, is introduced in Section 3.3. Section 3.4 presents the weighted integration of feature-tracking solutions from event frame and intensity frame. In Section 4, WF-MHT-BP is compared with two methods implemented in MATLAB and EKLT (Event-based Lucas–Kanade Tracker), implemented in C++ in terms of accuracy and efficiency, and EKLT in terms of feature-tracking length. Section 5 concludes the paper and gives the direction for future work.

2. Related Work

Feature tracking is an active research field, where a number of algorithms have been proposed. Traditional feature tracking on intensity images can be divided into feature-matching-based methods and template-based tracking methods [6]. Two representative methods are SIFT (Scale-Invariant Feature Transform) [7] and KLT [8,9] respectively. Recently, many deep-learning-based algorithms have been proposed to improve the available number and robustness of feature matching, such as SuperPoint [10] and D2Net [11]. However, the efficiency problem of deep-learning-based methods is an obstacle for practical applications, especially for mobile devices.

Due to the different characteristics of event cameras, feature tracking on event flow follows different paradigms. A practical way is to convert event flow to event frames. Usually, events are collected in a temporal window to form event frames, and then, traditional feature tracking paradigms can be applied [12,13]. To improve efficiency and accuracy, different event-to-frame transformations and feature-tracking methods are proposed [14].

Event-to-frame transformation is the first step for event-frame-based feature tracking. Time-surface (TS) is a kind of global 2D surface using exponential decay kernel [15] to emphasize events happening recently. Another global method is Event Map, proposed by Zhu et al. [12], to project the events in a selected spatio-temporal window on frames directly. Surface of active events (SAE) is a local form of processing 3D spatio-temporal domain that pays attention to the most recent event at each pixel [16]. Normally, feature detection on TS or SAE is more accurate than direct methods, as the response of events happening recently is larger. However, computational complexities of direct methods are much lower than that of TS or SAE. Besides, TS or SAE needs more memory, as at least floats are needed in event frames.

A number of event-camera-based feature-detection and tracking algorithms focusing on improving accuracy and efficiency have been proposed. Li et al. [17] proposed SAE-based FA-Harris corner detection algorithm directly on asynchronous events instead of event frames. Alzugaray et al. [18] proposed Arc* detector based on modified SAE filtering and subsequently proposed HASTE (multi-Hypothesis Asynchronous Speeded-up Tracking of Events), which purely tracked feature on an asynchronous patch using multi-hypothesis [19]. Tedaldi et al. [13] detected Harris features in intensity images and then used ICP (Iterative Closest Point) method to establish correspondences. Zhu et al. [12] proposed an affine transformation based Expectation-Maximization (EM) algorithm to align two patches in the consequent event frames.

The fusion of event frames and intensity images provides benefits for feature tracking. Gehrig et al. [20] proposed an event-camera-based tracker, which optimized brightness increment differences from intensity images and event flows. Dong et al. [21] proposed a template-based feature-tracking method to improve the robustness. They predicted feature-tracking solutions with events and used intensity to correct them. The calculation burdens of these methods cannot be ignored due to their high algorithm complexities. It is observed that only a few features are set to be tracked in these algorithms to ensure real-time performances, affecting the applications of high-level tasks.

Some high-level applications potentially achieve feature tracking by reconstructing 3D geometry. Usually, 3D coordinates of features act as prior information for feature tracking, as they can be projected to the image plane with predicted poses. Zhou et al. [3,22] tracked the pose of a stereo event camera and reconstructed 3D environments by minimizing spatio-temporal energy. Liu et al. [23] proposed spatial-temporal registration algorithm as a part of event-camera-based pose-estimation method. However, the performance of feature tracking and quality of high-level tasks (e.g., pose estimation, scene reconstruction) are closely related. Multiple factors affect feature-tracking accuracy. Moreover, computational burdens will increase with higher algorithm complexity.

In summary, feature tracking on event frames has efficiency and accuracy problems: (1) Tracking efficiency is low due to characteristics of event cameras and designed algorithms. The number of trackable feature points is small, which affects the stability of high-level tasks. (2) Purely tracking features on events easily cause accuracy problems. Although multiple sensor-fusion-based feature tracking has been proposed, the efficiency and accuracy problems still need to be further explored with all available information.

3. Methodology

The incoming data stream for an event-camera-based localization or mapping system is illustrated in Figure 1 as the basic input assumption of the proposed method. Event flow, intensity images, and inertial information will be received asynchronously. Note that polarity information of event flow is not a must for the proposed method. Normally, intensity images and inertial information have equal time intervals. If no dynamic objects are in the scene, event flow potentially represents carrier motion, and the “frequency” of event frames is not even if a constant number of events are collected. Therefore, the proposed method cannot predict the resource of the next input.

Another characteristic of the data stream is that features are only detected on intensity images. Since each intensity frame will be attached with an event frame (see Section 3.2), the features to be tracked are directly projected to event frames with the same position.

The implementation of the proposed method can be illustrated in Figure 2. Firstly, IMU provides angular velocity

w_{t}

for feature rotation prediction, which acts as priors for KLT tracking and MHT-BP. Secondly, event flow is accumulated to generate event frame

F_{t}

for practical applications. Feature-detection module provides the positions of features for initialization, and features will be also re-detected if the number or the distribution score [24] of features are less than the threshold. Thirdly, with the assistance of rotation prediction from IMU, the method assigns MHT-BP for event frame

F_{t}

and KLT tracking for intensity image

I_{t}

. The weighted fusion mechanism will fuse their solutions. The tracking solution is outputted by MHT-BP or weighted fusion module depending on the applications. After the feature-tracking process is over, the method will automatically add detected features if the number or the distribution score [24] of features are not large enough. Shi-Tomasi corner feature point [25] was chosen as the feature to be tracked. Note that the proposed method is not limited to feature types and can be extended by other features, such as FAST [26].

3.1. Inertial-Aided Rotation Prediction

According to the work of [27,28], the relationship between tracked features in two consecutive images can be approximated by Equation (1) considering the small translation relative to depth.

u_{p r e d} = {KRK}^{- 1} u_{l a s t}

(1)

where

K

is the intrinsic matrix, and

u_{p r e d}

and

u_{l a s t}

are 2D positions of the tracked feature and the predicted feature, respectively. The rotation matrix R can be integrated from angular velocity information

w_{t}

[27] as shown in Figure 2.

u_{p r e d}

will acts as the predicted feature position for the tracking.

3.2. Event-to-Frame Transformation

In this method, the generation of event frames follows that of Event Map instead of TS or SAE. Since TS- or SAE-based event-frame-generation method involves exponential computing, it is thus not used to reduce computing complexity.

Event-to-frame transformation follows two modes as shown in Equation (2): One is collecting a constant number

τ_{F i x N u m}

events. The events are projected on the image plane to form a binary image. Absence and presence of events within the temporal window are expressed by 0 and 1, respectively.

Another way is by the timestamps of intensity images. If one intensity image has been received, the events between the timestamp of last frame (intensity frame or event frame) and current timestamp, which is represented by

τ_{F i x T i m e}

, will be collected. That is, each intensity image has an attached event frame. This will cause one potential problem: If the number of events is too small, the generated event frame will be too sparse for reliable feature tracking. However, this situation will be alleviated by the following integration from intensity image:

Π (x, y) = \{\begin{matrix} 1 \\ 0 \end{matrix} \begin{matrix} i f e v e n t (x, y, t) \in τ_{F i x N u m} {or τ}_{F i x T i m e} \\ o t h e r s \end{matrix}

(2)

3.3. Affine Transformation-Based Multiple Hypothesis Testing for Batch Processing (MHT-BP)

The work of [19] proposed five hypotheses purely based on event flow, which are Null, East, North, West, and South hypotheses. Firstly, a template and a model are generated from a time window on event flow. Then the alignment score that quantifies their differences is calculated to guide the selection of the above-mentioned five hypotheses.

Inspired by the work of [19], a batch processing-based multiple hypothesis testing using four-parameter affine transformation model (MHT-BP) is proposed to improve feature-tracking efficiency. Compared with the five hypotheses in [19], the four-parameter affine transformation model explores more hypotheses to improve matching accuracy. Moreover, batch processing is conducted to improve efficiency.

Since with the above-mentioned data stream, the possible minimal “frequency: of event frames is equal to that of intensity frame, the difference between consecutive frames is small enough for patch comparison. The small feature motions bring three benefits to improve efficiency and accuracy: (1) neighboring areas near features can provide supporting regions to make multiple hypotheses; (2) a set of features are able to share the same affine transformation, and therefore, batch processing can be conducted to improve efficiency; and (3) the search range of four parameters in the affine transformation model can be small.

The four-parameter affine transformation model is applied to generate multiple hypotheses.

α, θ, Δ x, and Δ y

represent the variation on scale, rotation, and translation on

X

and

Y,

respectively. The four parameters between affine transformation are shown in Equation (3).

T_{A f f} = (\begin{matrix} \begin{matrix} α \cos θ & - α \sin θ & Δ x \\ α \sin θ & α \cos θ & Δ y \\ 0 & 0 & 1 \end{matrix} \end{matrix})

(3)

The illustrative example of MHT-BP is illustrated in Figure 3 given two event frames

F_{t}

and

F_{t + 1}

. A patch on

F_{t}

containing a set of features are selected with the four-parameter affine transformation model. The patch is transferred to generate multiple hypotheses; that is, each hypothesis corresponds to an affine transformation. After Gaussian blur, their differences are indicated by sum of absolute differences (SAD). The affine transformation model with minimum distance is chosen to establish correspondences between the two patches. This process will go on until all the features are involved.

3.4. Weighted Fusion Using Event/Intensity Information

MHT-BP is essentially a template-matching method purely using event information, which suffers from drift problems. Due to the high “frequency” of event frames, the mature KLT tracking results between intensity frames can guide the correspondence establishment between event frames. A constant-velocity model is used to re-predict the tracking solutions. A weighted fusion method is proposed to correct the drift and thus improve the accuracy. This provides two options for real-time processing and post processing. For real-time processing, features can be obtained from MHT-BP. For the post-processing tasks, such as bundle adjustment (BA) and SfM, poses and structure calculated from tracked features of event frames can still be obtained in a delayed manner.

As illustrated in Figure 4, the first row is one event frame sequence, and the second one is one intensity image sequence. The black dots and blue dots are detected features that share the same position. MHT-BP provides tracking solutions on each event frame. Due to the blurring effects of event frames, tracking solutions from MHT-BP suffer from drift problems. Drifts represented by uncertainties (shown in Figure 4) increase with time. However, since KLT tracking on the intensity sequence uses texture information, the increasing rate of drift is lower than that of MHT-BP through our tests. Therefore, the weighted fusion mechanism can reduce the drift error.

Firstly, the stochastic model of MHT-BP and KLT is expressed by

σ^{F}

and

σ^{I},

respectively, as Equations (4) and (5). The uncertainty is related to time and tracking quality.

σ_{F}^{2} = β Δ t

(4)

σ_{I}^{2} = δ Δ t

(5)

where

Δ t

is time period from the timestamp of first intensity image.

β

and

δ

are drift in the unit of pixel per second, which can adjust the weights for fusing the solutions. The parameters need to be reasonably and empirically defined.

Secondly, the velocity of optical flow from KLT tracking is assumed to be constant due to the small-motion assumption. If event frames exist between two intensity images, feature-tracking solution can be linearly estimated on virtual intensity frame, such as

I_{t - Δ t_{1}}

and

I_{t - Δ t_{2}}

(see the dotted border of the second row in Figure 4) using Equations (6) and (7).

x_{K L T}^{t_{i}} = Δ x \cdot \frac{t_{i}}{Δ t}

(6)

y_{K L T}^{t_{i}} = Δ y \cdot \frac{t_{i}}{Δ t}

(7)

where

Δ x

and

Δ y

represent feature displacements between

I_{t - 1}

and

I_{t}

on the

x

and

y

axis, respectively.

Δ t

represents the time period between

I_{t - 1},

and

I_{t}

.

t_{i}

represents consumed time period from

t - 1

.

Thirdly, with above-mentioned stochastic model and constant-velocity model, the weighted fusion can be conducted by using Equations (8) and (9). As shown in Figure 4, the red dots on last row show the result of weighted fusion.

x_{F u s e}^{t_{i}} = \frac{σ_{F}^{2} \cdot x_{E v e n t}^{t_{i}} + σ_{I}^{2} \cdot x_{K L T}^{t_{i}}}{σ_{I}^{2} + σ_{F}^{2}}

(8)

y_{F u s e}^{t_{i}} = \frac{σ_{F}^{2} \cdot y_{E v e n t}^{t_{i}} + σ_{I}^{2} \cdot y_{K L T}^{t_{i}}}{σ_{I}^{2} + σ_{F}^{2}}

(9)

The weighted fusion of MHT-BP and KLT solutions is named as WF-MHT-BP. Like the normal feature-tracking algorithm, it also detects new features after processing one event frame when the number of tracked feature or the distribution score is less than the threshold.

The proposed WF-MHT-BP can be summarized as Algorithm 1. Facing the structure of data stream shown in Figure 1, firstly, the inertial information is used to predict feature-tracking rotation. Then, if an event frame arrives, MHT-BP will purely use event information to generate tracking solutions. If an intensity image arrives, the proposed algorithm (WF-MHT-BP) can provide tracking solutions by fusing tracking solutions of KLT and MHT-BP. It should be noted that both MHT-BP and the proposed algorithm are able to output tracking solutions depending on when tracking solutions are needed.

Algorithm 1: Feature-tracking method based on integration of event, intensity and inertial information (WF-MHT-BP)
Input: Event $=$ {x,y,t}, $τ_{F i x N u m}$ , $τ_{F i x T i m e}$ , search range of $α, θ, Δ x, Δ y$ , $β, δ$ . Intensity image $I,$ angular velocity $w,$ and threshold for feature detection $Θ$ and $Θ_{s c o}$ Output: Tracking solutions on the current intensity or event frame
1	Predict feature rotation using Equation (1) with angular velocity $w$
2	Generate event frame $F_{t}$ by using Equation (2) with $τ_{F i x N u m}$ and $τ_{F i x T i m e}$
3	If an event frame $F_{t}$ is received, then
4	Track features using MHT-BP with search range $α, θ, Δ x, Δ y$
5	If tracking solutions are needed, then
6	Ouput tracking solutions on event frame $F_{t}$
7	If an intensity frame $I_{t}$ is received, then
8	Perform KLT tracking between $I_{t}$ and $I_{t - 1}$
9	Perform weighted fusion of tracking solution from KLT and MHT-BP
10	If NumOfTrackedFeature $\leq Θ$ or DistributionScore $\leq Θ_{s c o},$ then
11	Perform Shi-Tomasi detection on $I_{t}$
12	If tracking solution is needed, then
13	Output tracking solutions on intensity frame $I_{t}$

4. Experiments

The experiments chose 16 datasets from an event camera dataset publicly provided by University of Zurich [29], which uses a DAVIS 240C from Inivation, Zurich, Switzerland. It provides events flow, intensity images, and IMU measurements. The resolution of intensity images is 240 × 180. WF-MHT-BP runs on MATLAB platform in a computer with I5-10400F and 16 GB memory.

The input parameters are summarized in Table 1.

τ_{F i x N u m}

is set as 3000, which means when 3000 events arrive, an event frame will be generated. That is, an event frame will be formed once the number of events reaches 3000.

α, θ, Δ x, and Δ y

define the search range of affine transformation as illustrated in Equation (3).

β

and

δ

are drift in the unit of pixel per second used in Equations (4) and (5).

Θ

and

Θ_{s c o}

are set as 40 and 0.15, respectively, which means if the number or distribution score of tracked features is less than 40 or 0.15, Shi-Tomasi detection will be conducted to improve the number of newly detected features.

The state-of-the-art methods are chosen from open-source event-frame-based feature tracking methods. One is probabilistic data association-based tracking (PDAT) method proposed by Zhu et al. [12]. The publicly available implementation is used. Another is ICP-based feature-tracking algorithm (High Temporal Resolution Tracking algorithm, HTRT) based on the work of Tedaldi et al. [13]. Since there are no original implementation provided by the authors, an implementation by a third-party is used (https://github.com/thomasjlew/davis_tracker, accessed on 29 February 2022). The modifications are made to provide better performances for the comparison. ICP maximum iterations is changed to 3. The feature is changed to Shi-Tomasi feature, which is the same with WF-MHT-BP.

Compared with purely using event information in the above-mentioned methods, EKLT in C++ version, which integrates event and intensity information, is also compared in accuracy, efficiency, and feature-tracking length. Since the proposed algorithm is implemented in MATLAB version, which is normally slower than C++, the performance of EKLT is listed as a reference.

Firstly, MHT-BP, which is the internal parts of WF-MHT-BP, is compared with an open-source template-matching method to show its improved efficiency and comparable accuracy. Then WF-MHT-BP is compared with PDAT, HTRT, and EKLT in tracking accuracy and efficiency. Finally, the feature-tracking length is compared between EKLT and WF-MHT-BP.

4.1. Feature Tracking Accuracy and Comparison between MHT-BP and FasT-Match

The goal of feature matching on event frames is to find the affine transformation in small ranges between consecutive frames. FasT-Match [30] was chosen as the baseline to compare the efficiency and patch matching error, as it achieves the same goal with MHT-BP. Moreover, it is a template-patch-matching-based method with similar control flow as MHT-BP. The difference of MHT-BP and FasT-Match are: (1) MHT-BP uses a simplified affine transformation model, but Fast-Match uses more complex transformation model with six parameters. (2) Batch processing is used in MHT-BP, but FasT-Match does not have the mechanism. (3) FasT-Match has a branch-and-bound search strategy to find the parameters in the transformation model, but MHT-BP does not use strict termination conditions to improve efficiency. FasT-Match are implemented in MATLAB, which is the same with MHT-BP. “shapes_rotation” was chosen for efficiency and template-matching error comparison.

Small, average normalized patch errors (NPE) mean higher similarities between two patches. Figure 5 shows the cumulative distribution function(CDF) of NPE and consumed time. The curve of MHT-BP and FasT-Match is very similar. Average NPEof FasT-Match and MHT-BP are 0.049 and 0.041, respectively, which means both of their matching errors are very small, and their differences can be ignored compared with the error ranges (0, 1). The mean consumed time of MHT-BP and FasT-Match is 3.53 ms and 80.40 ms, respectively. The consumed time of MHT-BP is much lower than FasT-Match. The reason for the acceptable error and reduced computational complexity is MHT-BP has the mechanism of batch processing and four-parameter affine transformation to improve accuracy and efficiency. It can be concluded that the majority of normalized patch errors by FasT-Match are slightly lower than that of MHT-BP method. However, the consumed time for MHT-BP is much lower than that of FasT-Match, which shows around a 10

\times

–20

\times

increase in speed. Although matching error of MHT-BP increased by around 19% compared with FasT-Match (0.049 vs. 0.041), the matching error is corrected in a timely manner by KLT tracking in WF-MHT-BP in the next step. Therefore, MHT-BP is chosen as the internal part of WF-MHT-BP.

4.2. Feature-Tracking Accuracy Comparison

Figure 6 shows feature-tracking solutions from the proposed method. The traces of tracked features are projected on the first intensity image to show the matching results. The 16 datasets with different scenarios of lighting, objects, and motion are compared in feature-tracking accuracy.

Both MHT-BP and WF-MHT-BP will generate tracking solutions. The solutions between frames are filtered by fundamental matrix based RANSAC (RANdom SAmple Consensus). For the parameters for RANSAC-based fundamental matrix, Sampson distance threshold is set as 0.1. The inlier ratio is used as the indicator for tracking accuracy. PDAT and HTRT follow in the same way. However, EKLT corrects tracking solutions from event information when an intensity frame arrives, and the inlier ratio between intensity frame is used.

For each dataset, the average inlier ratio is calculated as shown in Table 2. Note that the inliers from RANSAC are not used for the next feature-tracking process. For each method, different scenes with different light, object, and motion settings have different average inlier ratios. It is interesting to find that inlier ratio of HTRT for all the datasets is around 50%. The inlier ratio of PDAT ranges from 38.19% in “hdr_poster” scenario to 80.80% in “shapes_transaltion” scenario. In the scenario of “boxes”-, “hdr”-, and “poster”-related scenarios, inlier ratio of PDAT decreases rapidly, showing the difficulty from environmental factors.

The inlier ratio of EKLT is much better than PDAT and HTRT, which reaches 88.26% on average. WF-MHT-BP achieves the highest inlier ratio for all datasets. The main differences between EKLT and WF-MHT-BP are: (1) EKLT uses gradients to track features on event frame without batch processing, but MHT-BP in the proposed method uses four-parameter affine transformation for feature tracking in the way of batch processing. (2) WF-MHT-BP uses inertial information to predict the positions of features, while EKLT does not use inertial information. (3) KLT only correct drifts on arrived intensity images. WF-MHT-BP uses a simple and efficient fusion mechanism to correct tracking solutions from MHT-BP and current positions of tracked features. Therefore, the main reason is that WF-MHT-BP integrated all available factors for feature tracking, making tracking solution more accurate.

4.3. Feature-Tracking Efficiency Comparison

The efficiency of WF-MHT-BP is compared with PDAT, HTRT, and EKLT since they have different numbers of initialized features, which is meaningful for real applications, as lower time complexity will lead to more abundant time for frame processing. Since different methods have a different number of tracked features, the efficiency is quantified by the consumed time per tracked feature

T_{P e r F e a}

, which is calculated as:

T_{P e r F e a} = \frac{T_{P e r F r m}}{N_{N u m F e a}}

(10)

where

T_{P e r F r m}

is the consumed time on each frame, and

N_{N u m F e a}

is the number of tracked features on each frame. Note that all the visualization parts of all algorithms are closed to ensure accurate consumed-time statistics.

As shown in Table 3, HTRT has the highest computational complexity, which reaches 1062 ms for tracking one feature in “dynamic_rotation” scenario, which is not practical for real-time high-level tasks. EKLT still needs tens of milliseconds to track one feature between two intensity images. PDAT consumes less time to track one feature point than EKLT, but in the scenario of “dynamic_translation” and “outdoors_walking”, it reaches 56 ms and 52 ms, respectively. WF-MHT-BP achieves the best efficiency among the four feature-tracking methods, which generally improves the efficiency by approximately three orders of magnitude compared with PDAT and EKLT and four orders compared with HTRT.

The main reason for high computational complexities of PDAT and HTRT is that the registration between two patches is done one by one. Besides, EM (Expectation Maximization) and ICP algorithms used in these methods are not suitable for real-time processing. Another reason for the high time consumption of HTRT is that the related event information to be processed needs to be searched from external memory for every event frame. The time consumption is at approximately the same level with EKLT at the beginning. As feature tracking continues processing, the time consumption becomes larger, resulting in the larger overall time consumption.

For EKLT algorithm, it involves the complex optimization of object function for tracking error and also does not have a batch-processing mechanism. Therefore, their consumed time is much larger than WF-MHT-BP. The efficiency problem is optimized in WF-MHT-BP by the batch-processing mechanism and simple loose integration with KLT tracking solution, which enables the lowest computational complexity of WF-MHT-BP.

4.4. Feature-Tracking Length Comparison

Feature-tracking length is another important indicator for feature tracking especially in VO or SLAM since the improved continuity of features can give more constraints for estimating poses or constructing maps. Without re-detecting features, feature-tracking length is quantified by the time period between the timestamp of first frame and the time when the number of tracked features decreases to 10% of the initial one. Due to the high computational complexity, HTRT is not further compared in its feature-tracking length.

The result is shown in Table 4: PDAT shows limited feature-tracking ability on tracking length, which means 90% of initialized feature can only last for 0.35 s. Its main difference with the other two methods is it purely uses event information for feature tracking, which can easily lose tracked features.

WF-MHT-BP and EKLT show similar feature-tracking length statistics, showing the superiority in integrating the measurement from two sensors. The commonality between the two methods is that intensity information is used to correct the feature-tracking drifts from event information, which is helpful to continue tracking features.

5. Concluding Remarks

This paper proposes a loosely integrated feature tracking method on event frames using event, intensity, and inertial information to improve the accuracy and efficiency problem. MHT-BP, which involves four-parameter affine transformation and batch processing, is proposed to achieve fast and short-term feature matching. Then, a weighted fusion algorithm involving the constant velocity model and the stochastic model for drifts is proposed to reduce drifts. Next, it corrects the drift by weighted fusion in the way of post-processing, which is still meaningful for event-camera-based applications, such as SfM and SLAM.

MHT-BP is compared with FasT-Match, showing better efficiency at the expense of slight accuracy decline. In comparison with three state-of-the-art methods, including both event-information-based methods (PDAT and HTRT) and one multiple-sensor fusion-based method (EKLT), WF-MHT-BP shows the significant superiority on accuracy and efficiency and comparable feature-tracking lengths with EKLT.

In the future, the work can be refined and extended in the following aspects: First, feature detection is still conducted on intensity frames in the initialization and re-detection stage. If more features are needed, but intensity images have not arrived, the accuracy of high-level tasks, such as VO and SLAM, may be affected. Therefore, feature detection on event frames still needs to be further explored. Second, the final goal of event-frame-based feature tracking is to improve the robustness in challenging environments or motions. Thus, future work will focus on the application of WF-MHT-BP on high-level tasks, such as event-camera-based localization and mapping applications.

Author Contributions

Conceptualization, X.L. and Z.L.; methodology, Z.L.; software and validation, Z.L.; writing—original draft preparation, Z.L., F.Z., and Y.L.; writing—review and editing, F.Z. and Y.L.; supervision, Y.L.; project administration, Y.L.; funding acquisition Z.L. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by project ZR2021QD148 supported by Shandong Provincial Natural Science Foundation and the Open Research Project (grant number ICT2021B17) of the State Key Laboratory of Industrial Control Technology, Zhejiang University, China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available at URL: https://rpg.ifi.uzh.ch/davis_data.html (accessed on 29 January 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Ma, Y.-Y.; Ye, Z.-H.; Liu, K.-H.; Chen, L. Event-based visual localization and mapping algorithms: A survey. Acta Autom. Sin. 2021, 47, 1484–1494. [Google Scholar]
Vidal, A.R.; Rebecq, H.; Horstschaefer, T.; Scaramuzza, D. Ultimate SLAM? Combining events, images, and IMU for robust visual SLAM in HDR and high-speed scenarios. IEEE Robot. Autom. Lett. 2018, 3, 994–1001. [Google Scholar] [CrossRef] [Green Version]
Zhou, Y.; Gallego, G.; Shen, S. Event-based stereo visual odometry. IEEE Trans. Robot. 2021, 37, 1433–1450. [Google Scholar] [CrossRef]
Franceschini, N. Small brains, smart machines: From fly vision to robot vision and back again. Proc. IEEE 2014, 102, 751–781. [Google Scholar] [CrossRef]
Shen, C. Brain-Like Navigation Technology Based on Inertial/Vision System. In Intelligent Information Processing for Inertial-Based Navigation Systems; Springer: Berlin/Heidelberg, Germany, 2021; pp. 95–114. [Google Scholar]
Zhai, M.; Xiang, X.; Lv, N.; Kong, X. Optical flow and scene flow estimation: A survey. Pattern Recognit. 2021, 114, 107861. [Google Scholar] [CrossRef]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Corfu, Greece, 20–27 September 1999; pp. 1150–1157. [Google Scholar]
Lucas, B.D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence, Vancouver, BC, Canada, 24–28 August 1981; pp. 674–679. [Google Scholar]
Tomasi, C.; Detection, T.K. Tracking of point features. Int. J. Comput. Vis. 1991, 9, 137–154. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 224–236. [Google Scholar]
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable CNN for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 8092–8101. [Google Scholar]
Zhu, A.Z.; Atanasov, N.; Daniilidis, K. Event-based feature tracking with probabilistic data association. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4465–4470. [Google Scholar]
Tedaldi, D.; Gallego, G.; Mueggler, E.; Scaramuzza, D. Feature detection and tracking with the dynamic and active-pixel vision sensor (DAVIS). In Proceedings of the 2016 Second International Conference on Event-Based Control, Communication, and Signal Processing (EBCCSP), Krakow, Poland, 13–15 June 2016; pp. 1–7. [Google Scholar]
Gallego, G.; Delbruck, T.; Orchard, G.M.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.; Conradt, J.; Daniilidis, K.; et al. Event-based Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020. [Google Scholar] [CrossRef] [PubMed]
Delbruck, T. Frame-free dynamic digital vision. In Proceedings of the International Symposium on Secure-Life Electronics, Advanced Electronics for Quality Life and Society, Tokyo, Japan, 6–7 March 2008; pp. 21–26. [Google Scholar]
Mueggler, E.; Forster, C.; Baumli, N.; Gallego, G.; Scaramuzza, D. Lifetime estimation of events from Dynamic Vision Sensors. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Istanbul, Turkey, 26–30 May 2015; pp. 4874–4881. [Google Scholar]
Li, R.; Shi, D.; Zhang, Y.; Li, K.; Li, R. FA-Harris: A fast and asynchronous corner detector for event cameras. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Venetian Macao, Macau, 3–8 November 2019; pp. 6223–6229. [Google Scholar]
Alzugaray, I.; Chli, M. Asynchronous corner detection and tracking for event cameras in real time. IEEE Robot. Autom. Lett. 2018, 3, 3177–3184. [Google Scholar] [CrossRef] [Green Version]
Alzugaray Lopez, I.; Chli, M. HASTE: Multi-Hypothesis Asynchronous Speeded-up Tracking of Events. In Proceedings of the 31st British Machine Vision Virtual Conference (BMVC 2020), Virtual Event, 7–10 September 2020; p. 744. [Google Scholar]
Gehrig, D.; Rebecq, H.; Gallego, G.; Scaramuzza, D. EKLT: Asynchronous photometric feature tracking using events and frames. Int. J. Comput. Vis. 2020, 128, 601–618. [Google Scholar] [CrossRef]
Dong, Y.; Zhang, T. Standard and Event Cameras Fusion for Feature Tracking. In Proceedings of the 2021 International Conference on Machine Vision and Applications, Singapore, 25–27 July 2021; pp. 55–60. [Google Scholar]
Zhou, Y.; Gallego, G.; Rebecq, H.; Kneip, L.; Li, H.; Scaramuzza, D. Semi-dense 3D reconstruction with a stereo event camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 235–251. [Google Scholar]
Liu, D.; Parra, A.; Chin, T.-J. Spatiotemporal Registration for Event-based Visual Odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4937–4946. [Google Scholar]
Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef] [Green Version]
Shi, J. Good features to track. In Proceedings of the 1994 IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; pp. 593–600. [Google Scholar]
Rosten, E.; Drummond, T. Machine learning for high-speed corner detection. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 430–443. [Google Scholar]
Hwangbo, M.; Kim, J.-S.; Kanade, T. Inertial-aided KLT feature tracking for a moving camera. In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, St. Louis, MO, USA, 10–15 October 2009; pp. 1909–1916. [Google Scholar]
Wei, H.; Tang, F.; Zhang, C.; Wu, Y. Highly Efficient Line Segment Tracking with an IMU-KLT Prediction and a Convex Geometric Distance Minimization. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021. [Google Scholar]
Mueggler, E.; Rebecq, H.; Gallego, G.; Delbruck, T.; Scaramuzza, D. The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and SLAM. Int. J. Robot. Res. 2017, 36, 142–149. [Google Scholar] [CrossRef]
Korman, S.; Reichman, D.; Tsur, G.; Avidan, S. FasT-Match: Fast Affine Template Matching. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, 23–28 June 2013; pp. 2331–2338. [Google Scholar]

Figure 1. Different frequency of intensity images, event frames, and inertial updates.

Figure 2. Flowchart of data stream and the proposed method WF-MHT-BP.

Figure 3. Illustrative example for multiple hypothesis testing with four-parameter affine transformation model.

Figure 4. Illustration of the weighted fusion method for MHT-BP and KLT.

Figure 5. CDF of normalized patch error and consumed time for MHT-BP and FasT-Match.

Figure 6. Illustration of feature tracking solutions from WF-MHT-BP (Each subgraph show the trajectories of tracked features projected on the first intensity image).

Table 1. Input parameters for the proposed method.

Input Parameters	Value
$τ_{F i x N u m}$	3000
$α$	(0.95, 1.05)
$θ$	$(- 5 °, 5 °$ )
$Δ x (pixel)$	(–5, 5)
$Δ y$ (pixel)	(–5, 5)
$β$ (pixel/s)	5
$δ$ (pixel/s)	2
$Θ$	40
$Θ_{s c o}$	0.15

Table 2. Inlier ratios of PDAT/HTRT/EKLT/WF-MHT-BP methods.

Dataset	PDAT	HTRT	EKLT	WF-MHT-BP
shapes_rotation	80.11%	52.94%	85.64%	97.16%
shapes_translation	80.80%	52.38%	89.83%	93.54%
shapes_6dof	80.36%	52.17%	84.74%	94.41%
boxes_rotation	41.97%	51.61%	85.99%	99.97%
boxes_translation	43.94%	51.52%	85.27%	99.98%
boxes_6dof	42.49%	51.85%	84.96%	99.85%
outdoors_running	55.06%	51.72%	98.86%	98.04%
outdoors_walking	77.92%	53.57%	95.54%	99.18%
dynamic_rotation	65.32%	50.84%	87.53%	99.95%
dynamic_translation	63.57%	50.62%	84.14%	99.93%
dynamic_6dof	70.66%	50.42%	84.74%	99.78%
poster_rotation	43.73%	51.22%	91.35%	99.38%
poster_translation	49.34%	50.84%	91.04%	99.96%
poster_6dof	41.65%	50.74%	87.20%	99.87%
hdr_poster	38.19%	50.57%	84.43%	99.94%
hdr_boxes	46.31%	50.70%	90.95%	99.86%
Average	57.59%	51.48%	88.26%	98.80%

Table 3. The consumed time per tracked features of PDAT/HTRT/EKLT/WF-MHT-BP methods (unit: ms).

Dataset	PDAT	HTRT	EKLT	WF-MHT-BP
shapes_rotation	47	469	28	1.00
shapes_translation	48	417	29	1.20
shapes_6dof	56	356	25	0.62
boxes_rotation	14	405	62	0.22
boxes_translation	13	390	68	0.15
boxes_6dof	14	637	66	0.36
outdoors_running	23	473	52	0.45
outdoors_walking	52	151	15	0.47
dynamic_rotation	24	1062	75	0.25
dynamic_translation	56	759	54	0.33
dynamic_6dof	23	566	66	0.19
poster_rotation	16	440	47	0.59
poster_translation	18	290	58	0.28
poster_6dof	12	777	69	0.13
hdr_poster	12	367	85	0.42
hdr_boxes	12	214	62	0.18
Average	27.8	485.8	53.8	0.43

Table 4. Average duration of one track for PDAT/EKLT/WF-MHT-BP methods (unit: s).

Dataset	PDAT	EKLT	WF-MHT-BP
shapes_rotation	0.69	3.30	1.92
shapes_translation	1.58	6.82	6.93
shapes_6dof	1.21	6.73	5.70
boxes_rotation	0.01	3.02	2.70
boxes_translation	0.22	3.65	3.93
boxes_6dof	0.14	4.05	4.41
outdoors_running	0.16	1.58	4.52
outdoors_walking	0.46	3.45	4.70
dynamic_rotation	0.55	2.39	3.24
dynamic_translation	0.15	2.91	4.48
dynamic_6dof	0.31	2.75	4.96
poster_rotation	0.01	2.76	0.69
poster_translation	0.03	3.38	2.62
poster_6dof	0.03	5.89	3.89
hdr_poster	0.05	4.84	2.57
hdr_boxes	0.05	1.89	3.35
Average	0.35	3.71	3.78

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Liu, Y.; Zhou, F.; Li, X. Intensity/Inertial Integration-Aided Feature Tracking on Event Cameras. Remote Sens. 2022, 14, 1773. https://doi.org/10.3390/rs14081773

AMA Style

Li Z, Liu Y, Zhou F, Li X. Intensity/Inertial Integration-Aided Feature Tracking on Event Cameras. Remote Sensing. 2022; 14(8):1773. https://doi.org/10.3390/rs14081773

Chicago/Turabian Style

Li, Zeyu, Yong Liu, Feng Zhou, and Xiaowan Li. 2022. "Intensity/Inertial Integration-Aided Feature Tracking on Event Cameras" Remote Sensing 14, no. 8: 1773. https://doi.org/10.3390/rs14081773

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intensity/Inertial Integration-Aided Feature Tracking on Event Cameras

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Inertial-Aided Rotation Prediction

3.2. Event-to-Frame Transformation

3.3. Affine Transformation-Based Multiple Hypothesis Testing for Batch Processing (MHT-BP)

3.4. Weighted Fusion Using Event/Intensity Information

4. Experiments

4.1. Feature Tracking Accuracy and Comparison between MHT-BP and FasT-Match

4.2. Feature-Tracking Accuracy Comparison

4.3. Feature-Tracking Efficiency Comparison

4.4. Feature-Tracking Length Comparison

5. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI