Next Article in Journal
SSD-EMB: An Improved SSD Using Enhanced Feature Map Block for Object Detection
Next Article in Special Issue
Channel Exchanging for RGB-T Tracking
Previous Article in Journal
Estimating the Instantaneous Frequency of Linear and Nonlinear Frequency Modulated Radar Signals—A Comparative Study
Previous Article in Special Issue
Effect of Enhanced ADAS Camera Capability on Traffic State Estimation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Spatio-Temporal Context, Correlation Filter and Measurement Estimation Collaboration Based Visual Object Tracking

1
Department of Electrical Engineering, International Islamic University, Islamabad 44000, Pakistan
2
Department of Software Engineering, Bahria University, Islamabad 44000, Pakistan
3
School of Electrical Engineering, Southeast University, Nanjing 210096, China
4
Department of Electrical and Computer Engineering, King Abdulaziz University, Jeddah 21589, Saudi Arabia
*
Authors to whom correspondence should be addressed.
Sensors 2021, 21(8), 2841; https://doi.org/10.3390/s21082841
Submission received: 8 March 2021 / Revised: 1 April 2021 / Accepted: 15 April 2021 / Published: 17 April 2021
(This article belongs to the Special Issue Object Tracking and Motion Analysis)

Abstract

:
Despite eminent progress in recent years, various challenges associated with object tracking algorithms such as scale variations, partial or full occlusions, background clutters, illumination variations are still required to be resolved with improved estimation for real-time applications. This paper proposes a robust and fast algorithm for object tracking based on spatio-temporal context (STC). A pyramid representation-based scale correlation filter is incorporated to overcome the STC’s inability on the rapid change of scale of target. It learns appearance induced by variations in the target scale sampled at a different set of scales. During occlusion, most correlation filter trackers start drifting due to the wrong update of samples. To prevent the target model from drift, an occlusion detection and handling mechanism are incorporated. Occlusion is detected from the peak correlation score of the response map. It continuously predicts target location during occlusion and passes it to the STC tracking model. After the successful detection of occlusion, an extended Kalman filter is used for occlusion handling. This decreases the chance of tracking failure as the Kalman filter continuously updates itself and the tracking model. Further improvement to the model is provided by fusion with average peak to correlation energy (APCE) criteria, which automatically update the target model to deal with environmental changes. Extensive calculations on the benchmark datasets indicate the efficacy of the proposed tracking method with state of the art in terms of performance analysis.

1. Introduction

Visual object tracking (VOT) has emerged as a dynamic study area due to its utilization in a wide range of applications such as human action recognition [1,2,3], traffic monitoring [4,5], pellet ore phase [6], smart city [7], embedded system [8], surveillance [9,10,11] and medical diagnosis [12,13]. While significant progress has been made in recent years, accurate estimation for tracking an object is still a challenge in a video sequence due to various factors such as scale variations, occlusion, deformation, background clutters, to name a few [14,15,16]. Target tracking methods, being classified as generative [17] and discriminative [18], are widely referred in literature with prominent applications. Generative tracking methods learn the appearance model of the target and search for the highest matching score. These methods achieve good tracking results at the expense of computational cost. Discriminative tracking methods treat it as binary classification and achieve favorable results. However, tracking in these methods might get affected when training data is small.

1.1. Related Work

Rich literature is available in visual object tracking dealing with target appearance models and model updating. In this section, three types of tracking algorithms mainly related to our tracking method are introduced: tracking by spatio-temporal context (STC), tracking by correlation filter, and tracking by Kalman filter (KF).

1.1.1. Tracking by STC

A tracking algorithm based on STC utilizes fast Fourier transform (FFT) to accelerate the calculations. Zhang et al. [19] proposed a spatio-temporal context (STC) based tracking model by formulating temporal relation between the target and its context in a Bayesian framework. Afterward, the model’s confidence map is maximized to determine the target location by further updating the tracking model and scale. Based on [19], Zhang et al. proposed an adaptive STC model for online tracking by incorporating a histogram of oriented gradients (HoG) features and color naming (CN) features in the STC framework. They also used the average difference between adjacent frames to adjust the learning rate when the model is updated [20]. To further improve tracking performance in the STC framework, Wang et al. [21] proposed an improved tracking model that combines STC with a convolutional neural network (CNN) to extract online CNN’s deep characteristics without training. A motion vector-based mechanism for predicting target position under motion is incorporated in the STC framework to improve the STC scale. It also combined a scale correlation filter with STC to extract different scale samples around the target and used the HoG operator to form a pyramid of scale characteristics [22,23].

1.1.2. Tracking by Correlation Filter

Correlation filters have been broadly applied in object tracking [24,25,26,27,28]. To solve scale estimation in correlation filtering, Danelljan et al. [29] proposed a tracker based on correlation filters for translation and scale in image scale pyramid representation. Implementation in [29] is optimized by using various strategies for reducing computational cost in [30]. Zhang et al. [31] used [30] as its base tracker and proposed motion aware correlation filters (MACF) based tracking method by incorporating joint motion estimation based Kalman filter in discriminative correlation filters and used confidence of squared response map (CSRM) criteria for model update and occlusion detection. Ma et al. [32] used implementation in [30] and proposed a fast and accurate scale estimation method by incorporating average peak to correlation energy (APCE) in a multiresolution translation filter. Li et al. [33] included a scale adaptive tracking method in KCF framework. They address the issue of fixed template size in KCF and incorporated HoG and CN features.

1.1.3. Tracking by Kalman Filter

Kalman filters are widely utilized for occlusion handling in various trackers [34,35,36,37,38]. Yang et al. [39] proposed an improved STC algorithm and combined the Kalman filter with STC making it more robust and used Euclidean distance to detect occlusion. Mehmood et al. [40] proposed a tracking algorithm similar to [39]. In their implementation, they have incorporated context-aware formulation and combined Kalman filter in the STC framework. Moreover, they used the maximum value of the response map for occlusion detection. Khan et al. [41] proposed an improved tracker based on long-term correlation tracking (LCT). They incorporated the Kalman filter in the LCT framework for occlusion handling and peak to side ratio (PSR) of response map for occlusion detection.
Based on the presented literature, it can be concluded that significant modifications have been made in the STC algorithm. These modifications are in terms of occlusion detection and handling mechanisms, target model update mechanisms, incorporation of scale update schemes, the fusion of various cues and features, along with deep learning techniques and adaptive learning rate mechanisms. In this article, all the tracking results of the proposed method are available on Google Drive Link:

1.2. Our Contributions

Based on related work, this article proposes an object tracking algorithm that enhances STC under scale variations, background clutter, occlusion, illumination variation, and deformation. The contributions of this article are numerous, described as follows:
(1)
We propose a scale correlation filter-based pyramid representation mechanism to accurately extract the target without accumulating the scale model’s error. We use a combination of spatio-temporal context and scale correlation filter to achieve accurate object tracking.
(2)
We introduce an effective method in which the object can be tracked accurately by utilizing extended Kalman filter (EKF) detection for nonlinear target motion. We also use the response map’s peak value to measure the reliability of the current estimated position. If the tracking result is unreliable, this method can regain the target position to continue tracking.
(3)
We propose an adaptive learning rate mechanism based on the average peak to correlation energy (APCE) based on the target appearance model. This method can effectively prevent the tracking model from the wrong appearance.
(4)
Experimental results have been presented on de facto standard videos to show the efficacy of the proposed method over STC [19], DCFCA [26], Modified KCF [28], MACF [31], Modified STC [40], and AFAM-PEC [41].
A correlation filter-based discriminative scale mechanism is incorporated in spatio-temporal learning in the proposed work, making it robust and effective in scenarios such as clutter background, illumination variation, scale variations, and fast motion. The adaptive learning rate mechanism is based on APCE between consecutive frames. It is fused in this framework so that the tracking model can be updated according to the target’s shape and motion. If the model is updated on a fixed learning rate, it does not cope with the target’s shape, thereby losing it in the subsequent frames.
The extended Kalman filter aspect, which is utilized in the current study, is when the target undergoes occlusion. The condition to decide whether the target is occluded is based on the response map’s maximum value. In the proposed tracker, not only extended Kalman filter is applied, but a mechanism is also devised for its activation in the STC framework, making it better both qualitatively and quantitatively than various trackers.
The current study focuses on addressing limitations in the spatio-temporal context framework by incorporating efficient scale-space formulation, occlusion detection, and handling and adaptive learning rate modules.

1.3. Paper Outline

This paper’s organization is as follows: a brief explanation of spatio-temporal context, tracking is given in Section 2. Section 3 defines scale correlation filter, extended Kalman filter, occlusion detection method, and adaptive learning rate mechanism by explaining the proposed method for online tracking. Experiment parameters are discussed in Section 4. Performance analysis is discussed in Section 5. Section 6 includes a discussion, while Section 7 concludes the paper.

2. Spatio-Temporal Context Tracking

STC tracking algorithm is based on the Bayesian framework for finding target location by utilizing context information. In every frame, the confidence map is maximized to compute the target center. The feature set around the target location in each frame is defined as X c = { n ( o ) = ( I ( o ) , o ) | o     c (x*)} where I ( o ) is the image greyscale value at location o while Ωc (x*) is the context around target center x*. It is shown in Figure 1.
To formulate the tracking problem, the confidence map is computed for estimation of the likelihood of target location:
n ( x ) = P ( x | j ) = n ( o ) X c   P ( x , n ( o ) | j ) = n ( o ) X c   P ( x , n ( o ) | j ) P ( n ( o ) | j )
where x is the target coordinates and j is the target, P ( n ( o ) | j ) is the context prior model that represents the features of context appearance. P ( x , n ( o ) | j ) is the spatial context model that formulates spatial relation between target position and its context. It identifies and resolves uncertainties for different image measurements. Confidence map function n ( x ) is defined as follows:
n ( x ) = P ( x | j ) =   m   e ( | x x * θ | ξ )
where m is constant for normalization, ξ is shape parameter, and θ is scale parameter. Appropriate selection of shape parameters helps the spatial context model learn. Setting ξ > 1 results in oversmoothing of confidence map near the center. If ξ < 1 sharp peak response is generated while learning spatial context. Due to these issues, STC uses ξ = 1. Context prior model needs to be calculated before learning spatial context model. Spatial context is modeled by the image intensity function and Gaussian weighted function mentioned in (3) and (4).
P ( n ( o ) | j ) = I ( o ) ω γ ( o x * )
ω γ = c   e ( | x x * | 2 σ 2 )
where σ is scale representation. (4) is restricted between 0 and 1 by using its normalization constant c . The closer the context location j is to the current target location x * , the larger the weight should be set to predict the target location in the next frame. (5) defines the spatial context model.
P ( x , n ( o ) | j ) = h sc ( x o )
Solving for spatial context.
= h sc ( x o ) I ( o ) ω γ ( o x * )
= h sc ( x ) ( I ( x ) ω γ ( x x * ) )
where ⊗ denotes convolution operation. Fast Fourier transform (FFT) is used for improving speed and it can be calculated as follows:
( n ( x ) ) = ( h sc ( x ) ) F ( I ( x ) ω γ ( x x * ) )
where ⊙ denotes element-wise multiplication. Solving (8) for the spatial context model.
h sc ( x ) = 1 ( F ( me | x x * θ | ξ ) F ( ( I ( x ) ω γ ( x x * ) ) ) )
where 1 denotes inverse FFT in (9). In the STC model, the target is initialized position at the first frame. The spatial context model h sc learns relatively spatial relations between different pixels in the Bayesian framework. For subsequent frames, the STC model H t + 1 stc ( x ) can be updated by using the spatial context model h t sc ( x ) . By computing the extreme of the confidence map center position of the target x t + 1 * at (t + 1) the frame can be attained as given in (10).
x t + 1 * = arg x c ( x t * )   max   n t + 1 ( x )
Similarly, a confidence map can be calculated from (11).
n t + 1 ( x ) = 1 ( F ( H t + 1 stc ( x ) ) ( I t + 1 ( x ) ω γ ( x x t * ) ) )
STC model is updated on learning rate ρ as mentioned in (12).
H t + 1 stc = ( 1 ρ ) H t + 1 stc + ρ h t sc
where ρ is the learning rate and h t sc is the spatial context model computed in (9).

3. Proposed Tracker

In this section, the proposed tracker will be discussed. First, a correlation filter-based adaptive scale scheme is discussed. Second, an extended Kalman filter-based occlusion handling mechanism is investigated. Third, an adaptive learning rate scheme is presented. The execution scheme of the proposed tracker is shown in Figure 2. In each image sequence, the target of interest’s location is initialized manually on the first frame from the given ground truth. Afterward, the target confidence map is calculated. Sample patches of a different set of scales are estimated from the confidence map of STC. Then, the maximum value of the response map is calculated. If the response map’s value is less than the fixed threshold, then the extended Kalman filter is activated. The Kalman filter will predict the location of next frame and update the tracking model during this entire period. Once the response map’s value exceeds the fixed threshold, then the Kalman filter is deactivated. Afterward, the learning rate is updated, and the target entire tracking model is updated based on the calculated position.
Different variables and notations used in the following sections are presented in Table 1.

3.1. Scale Space Tracking

Discriminative correlation filters are widely used in visual object tracking. For estimating the target scale, a scale correlation filter-based tracking model is used. It first extracts different scale samples around the target position; then, the HOG feature pyramid sample is extracted from the location. For finding an optimal correlation filter, cost function given in (13) needs to be minimized.
ε = l = 1 d h l f l g 2 + λ l = 1 d h l 2
where g is desired output, λ is the regularization term, is circular convolution operator, h is the HOG features after extracting from the sample, l indicates l-dimensional HOG features, g indicates two-dimensional Gaussian function, d indicates the total dimension of HOG features, and f is the correlation filter. The solution of (13) in the frequency domain is given in (14).
H l = G ¯ F l k = 1 d F k ¯ F k + λ
By minimizing output error over training patches, an optimal filter can be obtained. However, it is not suitable for online tracking because of computational cost. For efficient tracking, the numerator and denominator of the correlation filter H l are updated separately as given in (15) and (16).
A t l = ( 1 γ ) A t 1 l + γ G t ¯ F t l
B t = ( 1 γ ) B t 1 +   γ k = 1 d F t k ¯ F t k
where γ is the learning rate. By maximizing the correlation score, the target state can be determined as given in (17).
y = F 1 { l = 1 d A l ¯ Z l B + λ }
where Z l denotes HOG features extracted from prediction

3.2. Extended Kalman Filter

Within the visual object tracking research area, EKF is widely used to estimate the system. The target location problem can be viewed as an estimation problem, as it provides measurement-based prediction. For the current estimate, EKF linearizes the nonlinear equations. Afterward, EKF is applied to that linearized model [42]. EKF involves two steps which are prediction and correction. During prediction, state and covariance estimates are computed for the current frame using (18) and (19).
x t = A x ^ t 1 + Bu t
P t = AP t 1 A T + Q
where x t is the state target vector, B u t is noise, A is process Jacobian, Q is process noise covariance, and P t is the predicted error covariance. During the correction, Kalman gain K t is calculated. It balances between prior estimation uncertainty and measurement noise as given in (20).
K t = P t J H T ( J H P t J H T + R ) 1
where J H is measurement Jacobian and R is measurement noise. State estimate is updated using prior estimate and error between measurement and predictive measurement as given in (21).
x t = x ^ t + K t ( z t J H x ^ t 1 )
The difference ( z t J H x ^ t 1 ) is called measurement innovation or residual. It reflects the discrepancy between predicted measurement J H x ^ t 1 and actual measurement z t .
Posteriori estimation of variance in given in (22).
P t = ( I K t J H ) P t
where P t is the updated error covariance, J H is matrix related to the measurement of the state and K t is the updated Kalman gain.

3.3. Occlusion Detection

When the target undergoes occlusion, then the STC model is updated incorrectly, thereby losing the target. The maximum value of the target map is used to detect occlusion, which changes its value with the target state’s situation. If the target is occluded, then the value of the response map is small. However, when the target reappears then its value increases. The value of response map determines whether the target is tracked by improved STC or by EKF. For given input image sequence first the confidence map is computed in the frequency domain. If the target is severely occluded, then EKF will predict the position and update improved STC using a feedback loop for the next frame.

3.4. Adaptive Learning Rate

The model is updated adaptively by using average peak to correlation energy (APCE) [43]. It is defined in (23).
APCE t = | f max f min | 2 mean ( w , h ( f w , h f min ) 2 )
where f max is maximum response value, f min is minimum response value and f w , h is corresponding row and column value of response map. APCE specified the degree of fluctuation between response maps and detected targets. (24) gives expression of model update.
{ bz t = APCE t APCE 0 γ t = γ 0 ,       bz t > bz 0   γ t = γ 0 · bz t ,     otherwise
where APCE t is the value at t-th frame, APCE 0 is the value at the initial frame and b z 0 is threshold to decide learning rate.
Algorithm 1 is presented below.
Algorithm 1: Proposed Tracker at time step t
Input: Image Sequence of n Frames. Position of Target at First Frame.
Output: Target Position for each frame in Image Sequence.
for frame 1 to n frames.
(1)
Calculate context prior model using (3).
(2)
Calculate confidence map using (11).
(3)
Calculate target center.
(4)
Calculate translation correlation using (17).
(5)
Calculate the maximum value of response map.
(6)
if response map < threshold
(7)
new position = Kalman prediction
(8)
end
(9)
Calculate Kalman gain using (20).
(10)
Estimate position for next frame using (21).
(11)
Estimate error covariance using (22).
(12)
Calculate APCE using (23).
(13)
Update model using (24).
(14)
Calculate scale correlation using (17).
(15)
Update translation and scale model using (15) and (16).
(16)
Update context prior model using (3).
(17)
Update spatial context model using (9).
(18)
Update spatio-temporal context model using (12).
(19)
Calculate the target position for each frame.
(20)
Draw a rectangle on the target in each frame.
End

4. Experiments

To evaluate the performance of the proposed tracker both qualitatively as well as quantitatively, extensive experiments were conducted on image sequences selected from Temple Color (TC)-128 [44], OTB2013 [45], OTB2015 [46], and UAV123 [47] datasets. Challenging factors associated with these sequences are scale variations, deformation, partial or full occlusions, background clutter, illumination variations, and fast motion.

4.1. Evaluation Criteria

The proposed tracker was compared quantitatively with existing tracking methods based on distance precision rate (DPR) and center location error (CLE). CLE is defined as Euclidean distance calculated between the tracker and ground truth of target. The calculation formula is mentioned in (25).
CLE = ( x i x g t ) 2 + ( y i y g t ) 2
where ( x i , , y i ) are positions calculated by tracking algorithm and ( x g t , y g t ) are ground truth values. DPR is the percentage of frames when distance threshold is greater than the estimated CLE.

4.2. Parameter Settings

We set the same parameter values as in [19] and [29]. Map function parameters α and β were set to 2.25 and 1, respectively [19]. Regularization weight parameter λ was set to 0.01. Standard deviation of desired scale filter output was 0.25. Number of scales was 33, and scale factor was 1.02 [29]. We set these values as they turned out to be the best setting in our implementation. However, changing these parameters leads to inferior performance of tracker. The threshold of DPR is 20 pixels.

5. Performance Analysis

5.1. Quantitative Analysis

DPR comparison is given in Table 2. In sequences (Baby_ce, Car9, Carchasing_ce4, Crossing, Jogging2, Ring_ce, Singer1, Tennis_ce2, and Tennis_ce3), the proposed tracker outperforms Modified KCF, STC, MACF, and DCFCA. In sequences (Building3, Carchasing_ce3, Cardark, Cup, Juice, Man, Plate_ce2, and Sunshade), all tracking methods have similar performance. In sequences (Bike3, Busstation_ce2, Car4, Girl2, Guitar_ce2, Human3, Jogging1, Skating2, and Walking2), the proposed has slightly less precision value. However, the proposed has a higher mean value than other tracking methods.
The average center location error comparison is given in Table 3. In sequences (Baby_ce, Car4, Carchasing_ce4, Cardark, Crossing, Plate_ce2, Singer1, Tennis_ce2 and Tennis_ce3) the proposed tracker outperforms Modified KCF, STC, MACF and DCFCA. In sequences (Bike3, Building3, Busstation_ce2, Carchasing_ce3, Cup, Girl2, Guitar_ce2, Human3, Jogging1, Jogging2, Juice, Man, Ring_ce, Skating2, Sunshade, and Walking2), the proposed tracker has a slightly high error value. However, the proposed tracker has the lowest mean error compared to the other tracking methods.
The frame per second (FPS) comparison is given in Table 4. In sequences (Baby_ce, Building3, Car4, Carchasing_ce3, Carchasing_ce4, Cardark, Crossing, Cup, Jogging 2, Man, Plate_ce2, Tennis_ce2, and Tennis_ce3), the proposed tracker outperforms Modified KCF, STC, MACF, AFAM-PEC, Modified STC, and DCFCA in terms of accuracy. However, the frame rate of the proposed tracker is low in comparison with other tracking methods.
The precisions plots are shown in Figure 3. Table 2 provides the mean precision value of the tracker in the entire image sequence. However, the tracker might get drift for a few frames and then recover itself. Therefore, to review tracker performance during the whole image sequence, these plots are presented. Various challenges were present in sequences such as occlusion, scale variations, deformation, etc. In sequences (Baby_ce, Carchasing_ce3, Car4, Cardark, Carchasing_ce4, Crossing, Cup, Jogging1, Jogging2, Guitar_ce2, Man, Plate_ce2, Ring_ce, Singer1, Sunshade, Tennis_ce2, and Tennis_ce3), the proposed tracker has the highest precision in the entire sequence. In sequences (Bike3, Building3, Busstation_ce2, Car9, Girl2, Human3, Juice, Skating2, and Walking2), the proposed tracker has slightly low precision.
The location error plots are shown in Figure 4. In Table 3 average center location is calculated for each image sequence. It gives an idea about tracker performance, but it does not entirely incorporate all information necessary to review tracker performance. A possible scenario exists in object tracking when a tracker might drift for a few frames in a sequence resulting in a high error value. However, when the tracker recovers from drift and starts tracking the target accurately, the error will be low during those frames, but its average value will be high. Therefore, these plots are presented to review tracker performance on each frame. The proposed tracker performs consistently for sequences (Baby_ce, Car4, Car9, Cardark, Crossing, Carchasing_ce3, Carchasing_ce3, Cup, Guitar_ce2, Juice, Jogging2, Ring_ce and Tennis_ce3) over the entire duration. In sequences (Girl2, Human3, Skating2 and Walking2), the tracker gets drift between the frames but recovers after a few frames. For the majority of the frames in these sequences proposed tracker accurately tracks the target. However, when the tracker got drift, then the accumulative error is high for these sequences. In sequences (Bike3, Building3, Busstation_ce2, Jogging1, Man, Plate_ce2, Singer1, Sunshade, and Tennis_ce2) the proposed method has similar performance with compared trackers.

5.2. Qualitative Analysis

Figure 5 depicts the proposed tracking qualitative results with four state-of-the-art trackers over 26 image sequences involving various challenges such as partial or full occlusions, scale variations, background clutter, etc. MACF contains a similar tracking component as our approach, i.e., scale correlation filter and Kalman filter. Even though MACF performs favorably well in sequences involving scale variations, it does not deal effectively with sequences involving occlusions (Girl2, Human3, Jogging1, Jogging2, and Skating2). STC uses intensity features and response of a single translation filter to estimate scale. This makes STC a comparatively fast tracker; however, there is no occlusion detection or handling mechanism due to which its tracking results are affected in sequences (Busstation_ce2, Girl2, Human3, Jogging1, and Jogging2). Moreover, due to only one translation filter, its tracking results are also affected (Car9, Crossing, and Tennis_ce3). DCFCA contains correlation filtering combined with the context-aware formulation. However, it is not robust in occlusions, scale variations, and deformation challenges. Therefore, DCFCA does not perform well in sequences (Car9, Carchasing_ce4, Girl2, Human3, Jogging1, Jogging2, Skating2, and Tennis_ce3). Modified KCF performs significantly well in sequences involving occlusions. However, it does not perform well in scale variation sequences (Baby_ce, Car9, Carchasing_ce4, Guitar_ce2, Ring_ce, Singer1, Tennis_ce2 and Tennis_ce3).
It can be seen that the proposed tracking method outperforms other trackers in these sequences. In sequences (Baby_ce, Car4, Carchasing_ce4, Crossing, Cup, Jogging1, Jogging2, Guitar_ce2, Plate_ce2, Ring_ce, Singer1, Tennis_ce2 and Tennis_ce3) the proposed method can accurately track the target for entire image sequences. In sequences (Bike3, Busstation_ce2, Girl2, Human3, Skating2, and Walking2) the tracker cannot accurately perform for the entire sequence. In sequences (Building3, Carchasing_ce3, Cardark, Juice, Man, and Sunshade), all trackers have similar performance.

6. Discussion

It can be seen from Figure 5 that the proposed tracking method outperforms other trackers in these sequences. We discuss several observations from performance analysis. This performance can be strengthened for three reasons. First, the scale correlation filter is incorporated in the STC framework making it deal effectively better than the STC scale. This scale filter learns target appearance on different scales, making it better to track targets accurately under scale variation scenarios. It can be seen in sequences (Baby_ce, Car4, Car9, Carchasing_ce3, Carchasing_ce4, Plate_ce2, and Ring_ce) that the proposed tracker deals better with scale variation of the target. Second, the incorporation of an extended Kalman filter makes it robust to handle occlusions. When the target undergoes partial or full occlusions, then EKF predicts the target state and updates the tracking model. It can be seen in sequences (Girl2, Jogging1, and Jogging2) that the proposed method can effectively handle the target’s occlusion. Third, the fusion of APCE based adaptive learning rate further elevates the tracking performance in illumination variations, motion blur, clutter background challenges. It can be seen in sequences (Building3, Cardark, Crossing, Cup, Guitar_ce2, Juice, Man, Singer1, Sunshade, Tennis_ce2 and Tennis_ce3) that the tracker can accurately follow the target. It is because the tracker’s appearance model can cope with changes in the environment by utilizing information in each frame.
Even though the proposed tracker performs significantly better than various trackers, there are few sequences (Bike3, Busstation_ce2, Human3, Skating2, and Walking2) in which the tracker does not track the target accurately. In Bike3 the tracker fails due to fast movement combined with scale variation. In Skating2 the tracker fails due to the deformation of the target. In (Busstation_ce2, Human3, and Walking2) the tracker fails due to occlusions, fast motion, and motion blur. The limitations can be addressed by working in few directions, such as developing a better occlusion detection and handling mechanism, extending the aspect ratio adaptability, and incorporating context-aware formulation.

7. Conclusions

This article gives insight into the robust tracking algorithm based on STC by incorporating scale correlation filter based on pyramid representation for adaptive scale estimation, extended Kalman filter for occlusion handling, and APCE criteria for the adaptive learning rate of the tracking model. Experimental results indicate that the proposed tracking algorithm performs better than the various state of the art qualitatively and quantitatively. The tracker achieved the desired performance, but the target may be lost in some cases like occlusions, motion blur, and fast motion. To address these limitations, our future work includes extending the current framework to context-aware and target adaptation formulation, development of occlusion judgment criteria, incorporation of more features to learn target appearance, and extending the aspect ratio adaptability.

Author Contributions

Writing—original draft presentation, K.M.; conceptualization, K.M., A.J. and A.A.; supervision, A.A. and A.J.; writing—review and editing, K.M., A.A., A.J., B.K, M.M. and K.M.C.; data analysis and interpretation, K.M., B.K. and M.M.; investigation, K.M., A.A. and B.K.; methodology, K.M., A.J. and A.A.; software, B.K., M.M., K.M.C. and A.H.M.; visualization, K.M., M.M. and B.K.; resources, A.J., K.M.C. and A.H.M.; project administration, A.A. and A.J.; funding acquisition, K.M.C. and A.H.M. All authors have read and agreed to the published version of the manuscript.

Funding

No funding available.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Yao, L.; Liu, Y.; Huang, S. Spatio-temporal information for human action recognition. Eurasip J. Image Video Process. 2016, 39, 1–9. [Google Scholar] [CrossRef] [Green Version]
  2. Wang, X.; Chen, D.; Yang, T.; Hu, B.; Zhang, J. Action recognition based on object tracking and dense trajectories. In Proceedings of the International Conference on Automatica (ICA-ACCA), Curico, Chile, 19–21 October 2016; pp. 1–5. [Google Scholar]
  3. Aggarwal, J.K.; Xia, L. Human activity recognition from 3d data: A review. Pattern Recognit. Lett. 2014, 48, 70–80. [Google Scholar] [CrossRef]
  4. Hui, Z.; Yaohua, X.; Lu, M.; Jiansheng, F. Vision-based real-time traffic accident detection. In Proceedings of the 11th World Congress on Intelligent Control and Automation (WCICA), Shenyang, China, 29 June–4 July 2014; pp. 1035–1038. [Google Scholar]
  5. Tian, B.; Yao, Q.; Gu, Y.; Wang, K.; Li, Y. Video processing techniques for traffic flow monitoring: A survey. In Proceedings of the 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), Washington, DC, USA, 5–7 October 2011; pp. 1103–1108. [Google Scholar]
  6. Li, J.; Zie, J.; Hu, W.; Wang, L.; Yang, A. Research on the improvement of vision target tracking algorithm for Internet of things technology and Simple extended application in pellet ore phase. Future Gener. Comput. Syst. 2020, 110, 233–242. [Google Scholar] [CrossRef]
  7. Zhang, H.; Zhang, Z.; Zhang, L.; Yang, Y.; Kang, Q.; Sun, D. Object Tracking for a Smart City Using IoT and Edge Computing. Sensors 2019, 19, 1987. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Gong, X.; Le, Z.; Wang, H.; Wu, Y. Study on the Moving Target Tracking Based on Vision DSP. Sensors 2020, 20, 6494. [Google Scholar] [CrossRef]
  9. Oh, S.H.; Javed, S.; Jung, S.K. Foreground Object Detection and Tracking for Visual Surveillance System: A Hybrid Approach. In Proceedings of the 11th International Conference on Frontiers of Information Technology, Islamabad, Pakistan, 16–18 December 2013; pp. 13–18. [Google Scholar]
  10. Staniszewski, M.; Foszner, P.; Kostorz, K.; Michalczuk, A.; Wereszczyński, K.; Cogiel, M.; Golba, D.; Wojciechowski, K.; Polański, A. Application of Crowd Simulations in the Evaluation of Tracking Algorithms. Sensors 2020, 20, 4960. [Google Scholar] [CrossRef] [PubMed]
  11. Ali, A.; Kausar, H.; Muhammad, I.K. Automatic visual tracking and firing system for anti-aircraft machine gun. In Proceedings of the 6th International Bhurban Conference on Applied Sciences & Technology (IBCAST), Islamabad, Pakistan, 19–22 January 2009; pp. 253–257. [Google Scholar]
  12. Vasconcelos, M.J.M.; Ventura, S.M.R.; Freitas, D.R.S.; Tavares, J.M.R.S. Towards the automatic study of the vocal tract from magnetic resonance images. J. Voice Off. J. Voice Found. 2011, 25, 732–742. [Google Scholar] [CrossRef]
  13. Zhou, W.; Wu, C.; Yu, X.; Gao, Y.; Du, W. Automatic fovea center localization in retinal images using saliency-guided object discovery and feature extraction. J. Med. Imaging Health Inf. 2017, 7, 1070–1077. [Google Scholar] [CrossRef]
  14. Ali, A.; Jalil, A.; Niu, J.; Zhao, X.; Rathore, S.; Ahmed, J.; Iftikhar, M.A. Visual object tracking—classical and contemporary approaches. Front. Comput. Sci. 2016, 10, 167–188. [Google Scholar] [CrossRef]
  15. Yoon, G.-J.; Hwang, H.J.; Yoon, S.M. Visual Object Tracking Using Structured Sparse PCA-Based Appearance Representation and Online Learning. Sensors 2018, 18, 3513. [Google Scholar] [CrossRef] [Green Version]
  16. Fiaz, M.; Mahmood, A.; Javed, S.; Jung, S.K. Handcrafted and deep trackers: Recent visual object tracking approaches and trends. ACM Comput. Surv. 2019, 52, 1–44. [Google Scholar] [CrossRef]
  17. Mei, X.; Ling, H. Robust Visual Tracking and Vehicle Classification via Sparse Representation. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2259–2272. [Google Scholar]
  18. Hare, S.; Saffari, A.; Torr, P.H.S. Struck: Structured output tracking with kernels. In Proceedings of the International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 263–270. [Google Scholar]
  19. Zhang, K.; Zhang, L.; Liu, Q.; Zhang, D.; Yang, M.H. Fast visual tracking via dense spatio-temporal context learning. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–7 September 2014; pp. 127–141. [Google Scholar]
  20. Zhang, Y.; Wang, L.; Qin, J. Adaptive spatio-temporal context learning for visual tracking. Imaging Sci. J. 2019, 67, 136–147. [Google Scholar] [CrossRef]
  21. Wang, H.; Liu, P.; Du, Y.; Liu, X. Online convolution network tracking via spatio-temporal context. Multimed. Tools Appl. 2019, 78, 257–270. [Google Scholar] [CrossRef]
  22. Wan, H.; Li, W.; Ye, G. An improved spatio-temporal context tracking algorithm. In Proceedings of the 13th IEEE Conference on Industrial Electronics and Applications (ICIEA), Wuhan, China, 31 May–2 June 2018; pp. 1320–1325. [Google Scholar]
  23. Li, W.G.; Wan, H. An improved spatio-temporal context tracking algorithm based on scale correlation filter. Adv. Mech. Eng. 2019, 11, 1–11. [Google Scholar] [CrossRef] [Green Version]
  24. Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 583–596. [Google Scholar] [CrossRef] [Green Version]
  25. Ahmed, J.; Ali, A.; Khan, A. Stabilized active camera tracking system. J. Real-Time Image Proc. 2016, 11, 315–334. [Google Scholar] [CrossRef]
  26. Mueller, M.; Smith, N.; Ghanem, B. Context-Aware Correlation Filter Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1387–1395. [Google Scholar]
  27. Ali, A.; Jalil, A.; Ahmed, J. A new template updating method for correlation tracking. In Proceedings of the International Conference on Image and Vision Computing (IVCNZ), Palmerston North, New Zealand, 21–22 November 2016; pp. 1–6. [Google Scholar]
  28. Shin, J.; Kim, H.; Kim, D.; Paik, J. Fast and Robust Object Tracking Using Tracking Failure Detection in Kernelized Correlation Filter. Appl. Sci. 2020, 10, 713. [Google Scholar] [CrossRef] [Green Version]
  29. Danelljan, M.; Hager, G.; Khan, F.S.; Felsberg, M. Accurate scale estimation for robust visual tracking. In Proceedings of the British Machine Vision Conference (BMVC), Nottingham, UK, 9 February 2014; pp. 1–11. [Google Scholar]
  30. Danelljan, M.; Hager, G.; Khan, F.S.; Felsberg, M. Discriminative Scale Space Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1561–1575. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  31. Zhang, Y.; Yang, Y.; Zhou, W.; Shi, L.; Li, D. Motion-Aware Correlation Filters for Online Visual Tracking. Sensors 2018, 18, 3937. [Google Scholar] [CrossRef] [Green Version]
  32. Ma, H.; Lin, Z.; Acton, S.T. FAST: Fast and Accurate Scale Estimation for Tracking. IEEE Signal Process. Lett. 2020, 27, 161–165. [Google Scholar] [CrossRef]
  33. Li, Y.; Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–7 September 2014; pp. 254–265. [Google Scholar]
  34. Panqiao, C.; Mengzhao, Y. STC Tracking Algorithm Based on Kalman Filter. In Proceedings of the 4th International Conference on Machinery, Materials and Computing Technology, Hangzhou, China, 23–24 January 2016; pp. 1916–1920. [Google Scholar]
  35. Munir, F.; Minhas, F.; Jalil, A.; Jeon, M. Real time eye tracking using Kalman extended spatio-temporal context learning. In Proceedings of the Second International Workshop on Pattern Recognition, Singapore, 1–3 May 2017; p. 104431. [Google Scholar]
  36. Zhang, J.; Liu, Y.; Liu, H.; Wang, J. Learning Local–Global Multiple Correlation Filters for Robust Visual Tracking with Kalman Filter Redetection. Sensors 2021, 21, 1129. [Google Scholar]
  37. Khalkhali, M.; Vahedian, A.; Yazdi, H.S. Vehicle tracking with Kalman filter using online situation assessment. Robot. Auton. Syst. 2020, 131, 103596. [Google Scholar] [CrossRef]
  38. Ali, A.; Jalil, A.; Ahmed, J.; Iftikhar, M.A.; Hussain, M. Correlation, Kalman filter and adaptive fast mean shift based heuristic approach for robust visual tracking. Signal Image Video Process. 2015, 9, 1567–1585. [Google Scholar] [CrossRef]
  39. Yang, H.; Wang, J.; Miao, Y.; Yang, Y.; Zhao, Z.; Wang, Z.; Sun, Q.; Wu, D.O. Combining Spatio-Temporal Context and Kalman Filtering for Visual Tracking. Mathematicsc 2019, 7, 1059. [Google Scholar] [CrossRef] [Green Version]
  40. Mehmood, K.; Jalil, A.; Ali, A.; Khan, B.; Murad, M.; Khan, W.U.; He, Y. Context-Aware and Occlusion Handling Mechanism for Online Visual Object Tracking. Electronics 2021, 10, 43. [Google Scholar] [CrossRef]
  41. Khan, B.; Ali, A.; Jalil, A.; Mehmood, K.; Murad, M.; Awan, H. AFAM-PEC: Adaptive Failure Avoidance Tracking Mechanism Using Prediction-Estimation Collaboration. IEEE Access. 2020, 8, 149077–149092. [Google Scholar] [CrossRef]
  42. Zekavat, R.; Buehrer, R.M. An Introduction to Kalman Filtering Implementation for Localization and Tracking Applications. In Handbook of Position Location: Theory, Practice, and Advances, 2nd ed.; Wiley Online Library: Hoboken, NJ, USA, 2018; pp. 143–195. [Google Scholar]
  43. Wang, M.; Liu, Y.; Huang, Z. Large Margin Object Tracking with Circulant Feature Maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4800–4808. [Google Scholar]
  44. Liang, P.; Blasch, E.; Ling, H. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans. Image Process. 2015, 24, 5630–5644. [Google Scholar] [CrossRef] [PubMed]
  45. Wu, Y.; Lim, J.; Yang, M.H. Online object tracking: A benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2411–2418. [Google Scholar]
  46. Wu, Y.; Lim, J.; Yang, M.H. Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef] [Green Version]
  47. Mueller, M.; Smith, N.; Ghanem, B. A Benchmark and Simulator for UAV Tracking. In Computer Vision—ECCV 2016. Lecture Notes in Computer Science; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9905. [Google Scholar]
Figure 1. Spatial relation between an object and its context.
Figure 1. Spatial relation between an object and its context.
Sensors 21 02841 g001
Figure 2. Flowchart of the proposed tracking model.
Figure 2. Flowchart of the proposed tracking model.
Sensors 21 02841 g002
Figure 3. Precision plots comparison on TC-128, OTB2013, OTB2015, and UAV123 datasets.
Figure 3. Precision plots comparison on TC-128, OTB2013, OTB2015, and UAV123 datasets.
Sensors 21 02841 g003aSensors 21 02841 g003bSensors 21 02841 g003c
Figure 4. Center location error (in pixels) comparison on TC-128, OTB2013, OTB2015 and UAV123 datasets.
Figure 4. Center location error (in pixels) comparison on TC-128, OTB2013, OTB2015 and UAV123 datasets.
Sensors 21 02841 g004aSensors 21 02841 g004bSensors 21 02841 g004cSensors 21 02841 g004d
Figure 5. Qualitative comparison on TC-128, OTB2013, OTB2015, and UAV123 datasets.
Figure 5. Qualitative comparison on TC-128, OTB2013, OTB2015, and UAV123 datasets.
Sensors 21 02841 g005aSensors 21 02841 g005bSensors 21 02841 g005c
Table 1. List of notations/variables.
Table 1. List of notations/variables.
SymbolNote
H l 1-dimensional HOG feature extracted from the sample
F t l Correlation filter at tth frame
A t l The numerator of correlation filter at tth frame
B t The denominator of correlation filter at tth frame
G t 2-dimensional Gaussian function
y Response map of correlation filter
x t Predicted state at tth frame
P t Updated error covariance at tth frame
Table 2. Distance precision rate at the threshold of 20 pixels.
Table 2. Distance precision rate at the threshold of 20 pixels.
SequenceProposedModified KCFModified STCSTCAFAM-PECMACFDCFCA
Baby_ce10.5910.5910.6990.45610.997
Bike30.2060.1660.2960.2750.1240.2750.262
Building31111111
Busstation_ce20.2380.8890.8780.1940.92710.886
Car40.9970.9980.4520.9910.9810.998
Car90.9880.3620.9760.2010.9170.9880.424
Carchasing_ce31111111
Carchasing_ce410.4000.5560.9950.19910.717
Cardark1111111
Crossing110.5750.533111
Cup1111111
Girl20.8300.5910.3720.2620.9400.0970.071
Guitar_ce20.5680.5050.1080.5240.5240.5240.581
Human30.3020.0060.0180.0880.7950.0050.006
Jogging 10.8790.9930.9960.2280.9730.2310.231
Jogging 20.9800.9450.2280.1850.9900.1660.160
Juice1111111
Man1111111
Plate_ce21111111
Ring_ce10.9050.1291111
Singer110.81511110.843
Skating20.0740.3020.0760.02300.0140.423
Sunshade110.2281111
Tennis_ce210.6560.1010.652111
Tennis_ce310.0980.1860.6910.1080.1070.108
Walking20.6940.4080.9340.4420.72210.564
Mean Precision0.8370.7170.6040.6530.7940.7460.703
Table 3. Average center location error.
Table 3. Average center location error.
SequenceProposedModified KCFModified STCSTCAFAM-PECMACFDCFCA
Baby_ce3.9381.3837.2512.0240.424.898.67
Bike3131.33123.7086.3781.9775.2083.5887.73
Building32.021.961.791.501.973.762.02
Busstation_ce279.3310.7410.8678.256.23.589.71
Car42.832.93229.814.285.043.082.66
Car95.94210.3613.69205.73.483.08255.42
Carchasing_ce32.683.043.903.552.522.393.05
Carchasing_ce42.0626.73112.992.921402.2416.68
Cardark1.036.043.212.833.351.675.11
Crossing1.232.2427.0534.064.711.642.20
Cup2.824.024.634.842.483.113.85
Girl230.7998.93101.77200.59.32137.62356.78
Guitar_ce219.2059.91168.9329.7219.1219.0316.06
Human366.31249.60348.38210.815.2308.41257.99
Jogging 118.343.728.4050103.8794.9389.44
Jogging 25.464.7443.04104.025.09148.98148.33
Juice2.161.964.635.082.420.911.92
Man2.002.361.321.492.201.732.23
Plate_ce21.231.792.582.341.211.621.83
Ring_ce1.565.2169.551.301.711.801.68
Singer12.5012.846.585.767.223.3412.65
Skating2142.1878.6769.79106.33200.4277.6046.90
Sunshade4.914.5468.684.994.574.204.84
Tennis_ce25.5131.21133.6916.925.665.745.69
Tennis_ce35.7897.2979.5840.7391.190.9590.72
Walking245.1532.0911.9413.8346.344.8122.33
Average Error22.6344.5463.48237.9126.9546.7256.02
Table 4. Frames per second (FPS).
Table 4. Frames per second (FPS).
SequenceProposedModified KCFModified STCSTCAFAM-PECMACFDCFCA
Baby_ce14.35101.8734.6893.2027.9859.2100.77
Bike315.12194.8015.8322.8333.14102.0229.93
Building313.8099.7913.2422.8914.6073.872.16
Busstation_ce29.1456.5710.0738.6730.0832.447.50
Car47.1397.9111.1523.1613.0158.676.33
Car92.4015.774.5929.2310.4414.119.39
Carchasing_ce326.47165.0054.62139.4736.8878.3125.51
Carchasing_ce47.334.7711.5160.6211.5332.616.40
Cardark24.13144.2749.02140.6924.3757.092.73
Crossing19.0172.6447.80112.1039.6934.538.44
Cup10.2333.6722.2476.7715.9847.616.57
Girl25.0618.7011.7712.5015.6223.715.31
Guitar_ce20.923.701.8312.105.164.6810.60
Human310.4914.5419.48131.0221.1838.646.84
Jogging 110.3012.2828.2834.3727.8925.849.30
Jogging 27.3223.3116.8138.2322.5130.014.97
Juice9.9639.3619.9999.7020.2438.638.78
Man24.3754.0641.99127.5415.5545.663.92
Plate_ce212.91166.836.0725.0922.9069.0179.17
Ring_ce19.18126.0433.4567.6815.9965.1125.46
Singer11.5716.403.1220.6213.479.5615.59
Skating21.1310.942.6958.0015.288.0516.18
Sunshade10.8141.0230.9775.3925.9840.136.15
Tennis_ce23.5714.738.7425.1113.9428.918.60
Tennis_ce36.7019.8112.4428.7222.9958.248.47
Walking26.6924.4520.7534.6824.6544.022.94
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Mehmood, K.; Jalil, A.; Ali, A.; Khan, B.; Murad, M.; Cheema, K.M.; Milyani, A.H. Spatio-Temporal Context, Correlation Filter and Measurement Estimation Collaboration Based Visual Object Tracking. Sensors 2021, 21, 2841. https://doi.org/10.3390/s21082841

AMA Style

Mehmood K, Jalil A, Ali A, Khan B, Murad M, Cheema KM, Milyani AH. Spatio-Temporal Context, Correlation Filter and Measurement Estimation Collaboration Based Visual Object Tracking. Sensors. 2021; 21(8):2841. https://doi.org/10.3390/s21082841

Chicago/Turabian Style

Mehmood, Khizer, Abdul Jalil, Ahmad Ali, Baber Khan, Maria Murad, Khalid Mehmood Cheema, and Ahmad H. Milyani. 2021. "Spatio-Temporal Context, Correlation Filter and Measurement Estimation Collaboration Based Visual Object Tracking" Sensors 21, no. 8: 2841. https://doi.org/10.3390/s21082841

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop