Weakly-Supervised Video Anomaly Detection with MTDA-Net

Wu, Huixin; Yang, Mengfan; Wei, Fupeng; Shi, Ge; Jiang, Wei; Qiao, Yaqiong; Dong, Hangcheng

doi:10.3390/electronics12224623

Open AccessArticle

Weakly-Supervised Video Anomaly Detection with MTDA-Net

by

Huixin Wu

¹,

Mengfan Yang

¹,

Fupeng Wei

^1,*

,

Ge Shi

¹

,

Wei Jiang

¹

,

Yaqiong Qiao

¹ and

Hangcheng Dong

^2,*

¹

School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450046, China

²

School of Instrumentation Science and Engineering, Harbin Institute of Technology, Harbin 150001, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(22), 4623; https://doi.org/10.3390/electronics12224623

Submission received: 8 October 2023 / Revised: 1 November 2023 / Accepted: 10 November 2023 / Published: 12 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

Weakly supervised anomalous behavior detection is a popular area at present. Compared to semi-supervised anomalous behavior detection, weakly-supervised learning both eliminates the need to crop videos and solves the problem of semi-supervised learning’s difficulty in handling long videos. Previous work has used graph convolution or self-attention mechanisms to model temporal relationships. However, these methods tend to model temporal relationships at a single scale and lack consideration of the aggregation problem for different temporal relationships. In this paper, we propose a weakly supervised anomaly detection framework, MTDA-Net, with emphasis on modeling different temporal relationships and enhanced semantic discrimination. To this end, we construct a new plug-and-play module, MTDA, which uses three branches, Multi-headed Attention (MHA), Temporal Shift (TS), and Dilated Aggregation (DA), to extract different temporal sequences. Specifically, the MHA branch can globally model the video information and project the features into different semantic spaces to enhance the expressiveness and discrimination of the features. The DA branch extracts temporal information of different scales via dilated convolution and captures the temporal features of local regions in the video. The TS branch can fuse the features of adjacent frames on a local scale and enhance the information flow. MTDA-Net can learn the temporal relationships between video segments on different branches and learn powerful video representations based on these relationships. The experimental results on the XD-Violence dataset show that MTDA-Net can significantly improve the detection accuracy of abnormal behaviors.

Keywords:

weakly supervised; temporal modeling; anomaly detection

1. Introduction

Detecting and stopping the anomaly behavior through monitoring is a very meaningful work. In the past, there were specialized people responsible for checking the monitoring equipment and finding the anomaly behavior. However, with the popularity of monitoring systems and the large increase in monitoring equipment, manual monitoring is increasingly difficult to meet the growing demand of practical applications. Along with the huge progress of deep neural networks in video behavior recognition [1,2,3], using neural networks to efficiently and quickly perform anomaly behavior detection has become a very good method.

At present, more and more experts and scholars pay attention to the research methods of anomaly detection [4,5,6]. The traditional method is to treat anomaly detection as a subtask of behavior detection via supervised learning [7,8]. However, unlike behavior detection, anomaly detection often has extremely unbalanced positive and negative samples due to the sensitivity and scarcity of its data, which leads to the unsatisfactory effect of supervised learning. Therefore, a semi-supervised learning anomaly detection method was proposed [9], which allows the model to learn a large number of positive samples, and those with large differences from positive sample features are defined as negative samples. However, the semi-supervised learning dataset requires cropping the segments where the anomaly behavior occurs. These videos are often only 2–3 s long, which causes this method to not only lack consideration in dealing with long videos, but also encounter difficulty in locating the occurrence position of the anomaly behavior. In addition, due to the difficulty of exhausting positive samples, this method has a high false positive rate.

In view of the shortcomings of semi-supervised learning, a good solution is to annotate each frame of the video and construct a frame-level dataset, such as Violent Scene Detection (VSD) [10]. However, annotating each frame of the video is a time consuming and laborious process, and it is difficult to construct a large-scale dataset. To solve this problem, weakly supervised anomaly detection has received attention [11,12,13]. Weakly supervised Video Anomaly Detection (VAD) assumes that only video-level normal or anomaly labels are given during training, without the need to annotate the position of the anomaly behavior. This can greatly reduce the annotation cost and difficulty, improving data utilization and scalability. Compared with supervised learning, weakly supervised learning has the following advantages: (1) The training data is uncropped, containing both normal and anomaly behaviors, allowing the model to learn more fine-grained features. (2) The training data only requires video-level labels, but the model can eventually locate the position of the anomaly behavior. (3) Since the training data does not require frame-level labels, it has more advantages in constructing large-scale datasets.

A popular method for weakly supervised learning is multiple instance learning, which treats each video as a bag, each video frame as an instance, and assumes that a bag contains the anomaly behavior if and only if at least one instance is an anomaly. In this paper, we aim to study the weakly supervised anomaly detection method based on multiple instance learning (MIL).

In anomaly detection, it is essential to establish a temporal relation model of video clips to capture contextual information. The mainstream methods can be roughly divided into three categories: one is based on CNN methods, which use dilated convolution to increase the receptive field and temporal modeling ability [14,15] to capture temporal information. Another is based on RNN methods, which can utilize the memory and recurrence ability of RNN to extract temporal information from video frames, such as actions, scenes, semantics, etc. [16]. The last one is based on attention mechanism methods, which can use the selectivity and flexibility of attention to extract the most important and relevant information from video frames, such as key frames, key regions, key objects, etc. [17]. However, CNN-based methods often have large computational costs and unsatisfactory results, RNN-based methods are prone to encounter problems such as gradient vanishing and long-term dependency, and attention-based methods are flexible but easily disturbed by noise when modeling globally.

To address these issues, we propose a novel temporal relation aggregation network MTDA-Net, which can explicitly leverage the relations between different segments and learn powerful representations based on these relations. We construct a new MTDA module, which consists of three branches: MHA, TS, and DA, to extract different temporal information. Specifically, we capture the global features of the video on the MHA branch and map the features to different semantic spaces. On the DA branch, we obtain local temporal features via dilated convolutions of different scales. To further facilitate the information flow, we introduce a lightweight adjacent frame feature fusion branch TS, which combines the features of adjacent levels on a local scale. Finally, the features of the three branches are concatenated along the channel dimension and passed through a fully connected layer to extract more informative features.

The main contributions of this paper are summarized as follows:

We propose a simple and effective anomaly detection network MTDA-Net, which can explicitly leverage the relations between different video frames and learn discriminative features based on these relations.
We construct a new plug-and-play MTDA module, which consists of three branches that complement each other in terms of features and achieves fine-grained context enhancement via feature fusion.
We achieve competitive results on the XD-Violence dataset with our model and conduct detailed ablation experiments to explore the effects of different parts of the model.

2. Related Work

2.1. Supervised and Semi-Supervised Anomaly Detection

The traditional method is to treat anomaly detection as a subtask of action detection. Smeureanu et al. [8] fed the features extracted by pre-trained convolutional neural networks into support vector machines for classification. Lin Wang et al. [7] explored the application of hybrid autoencoder architectures based on the LSTM encoder and the convolutional autoencoder in anomaly detection. Lee et al. detected anomalies using a multi-scale aggregation network [18]. However, due to the imbalance of positive and negative samples in real-life anomalies, this may have some negative effects in practical deployment. One way to solve this problem is to use semi-supervised learning methods for anomaly detection [19,20,21,22], which only use normal samples in the training phase. S Akcay et al. [9] proposed an anomaly detection algorithm based on semi-supervised GAN, which detects anomalies using conditional generative adversarial networks. Lukas Ruff et al. [23] proposed Deep SAD, an end-to-end semi-supervised learning detection method to discover anomalies. Semi-supervised learning methods tend to use a large number of positive samples, learn the behavior patterns of positive samples, and samples that deviate from positive samples are considered negative samples. However, the infinity of positive samples leads to a high false positive rate of this method.

2.2. Weakly Supervised Anomaly Detection

In recent years, some studies have focused on weakly supervised anomaly detection methods that use video-level annotations during training and frame-level annotations during validation [11,12,13,24,25,26,27]. This method not only saves a lot of annotation time, but also can generate frame-level annotations during inference, which can easily locate the position of the anomaly behavior. Sultani et al. [6] proposed a deep multi-instance learning model, which uses margin ranking loss to maximize the anomaly score between positive and negative samples. Zhu et al. [28] proposed a temporal enhancement network, which performs global temporal modeling via an attention mechanism. Zhong et al. [29] proposed to use a graph convolutional neural network to refine the noisy prediction from video-level labels. Wu et al. [11] proposed a new anomaly detection dataset and constructed a neural network HL-Net, which extracts local and global features using graph convolution and detects anomalies by fusing features. Pu et al. [30] proposed the DDL method, which models adjacent segments in adjacent MILs. Inspired by their work, we propose a novel temporal relation aggregation network for video anomaly detection. The difference between our method and theirs is that we construct a new temporal modeling module MTDA, which complements the features with different temporal modeling methods to extract features with higher information entropy contained in different temporal modeling methods. In addition, we consider the multi-modal data input specifically and use a multi-head attention mechanism to map different modal data to different semantic spaces.

3. Methods

The proposed temporal relation aggregation network aims to train with weakly labeled videos to distinguish between anomaly and normal segments. The core module of the network is the MTDA module with three branches, where the first branch is the multi-head attention branch MHA that maps video features and audio features to different semantic spaces. The second branch is the temporal pyramid branch DA, which extracts multi-scale information on the temporal dimension via different dilated convolution blocks. The third branch is the temporal shift module TS, which perturbs the temporal sequence by partially flowing the information of different frames on the temporal dimension to achieve the effect of temporal modeling. The overall structure is shown in Figure 1.

3.1. Notations and Preliminaries

We use V and y to denote the untrimmed video and its corresponding label. For convenience, we use the I3D network [31] and the Vgg network [32] to extract the video and audio features. respectively. We use

X^{v}

and

X^{A}

to represent the extracted video and audio features, where

X^{V} \in R^{T \times d^{V}}

,

X^{A} \in R^{T \times d^{A}}

, T represents the number of frames extracted from the video;

d^{v}

and

d^{A}

represent the feature lengths of each frame of the video and audio, respectively; and

x_{j}^{V}

and

x_{j}^{A}

represent the j-th segment of the video and audio features, respectively. In addition, we use

X^{S}

to represent the mixed-modal feature,

X^{S} \in R^{T \times d^{V + A}}

. According to previous multiple instance weakly supervised detection methods [6,28,33,34], the sliding window for each video is set to Q frames. In our experiments, we set Q to 16. For training videos, only video-level labels can be used. If any segment in the video is an anomaly behavior, then the video is classified as an anomaly video with label y = 1. Only when all segments in the video are normal videos, the video is classified as a normal video with label y = 0. For validation videos, not only the video is labeled as whether it is an anomaly or not, but also the specific location of the abnormality is labeled. Therefore, in the validation phase, the goal of this task is to generate frame-level labels to determine the start and end positions of anomaly behaviors.

3.2. MTDA Module

In anomaly behavior detection, image information is the most intuitive information. We can often tell whether a video is an anomaly or not just from the image. For example, if there are explosions, shootings, etc., in the video, then it is definitely an anomaly behavior. However, sound information is also an important supplementary information. For example, gas leakage will make a hissing sound, but this kind of leakage cannot be detected visually. Therefore, we use both the image and audio modalities of the data.

MHA branch. Pu et al. captured global and local dependencies of segments via self-attention mechanisms and masks to model complete temporal relationships [27]. Inspired by their work on global modeling via the attention mechanism, we also use the attention mechanism for global modeling. Pu et al. used only RGB features. Unlike them, we use features from both image and audio modalities; hence, we use the Multihead Attention mechanism (MHA) [35] for global modeling. Multihead Attention was originally applied in Natural Language Processing, where it allows for the creation of different projection information in several different projection spaces. The input matrices, which are projected differently, are stitched together after obtaining many output matrices. We use it as our MHA branch, through which we can extract the global information of the video and map the features of the mixed modality into different semantic spaces, which avoids over focusing attention on its own position, and through the MHA branch, our model has an explicit use of the features of the multimodality. We use Q, K, and V to denote the query vector, key vector, and value vector, respectively, and the formula for Q, K, and V in anomaly behavior detection is shown in Equation (1):

\begin{matrix} Q_{i} = X^{S} W_{i}^{Q} \end{matrix}

(1)

\begin{matrix} K_{i} = X^{S} W_{i}^{K} \end{matrix}

(2)

\begin{matrix} V_{i} = X^{S} W_{i}^{V} \end{matrix}

(3)

where i represents the attention head, and

W_{i}^{Q}

,

W_{i}^{K}

,

W_{i}^{V}

represent the weight matrices. We use Attention to represent the attention calculation function, of which the function expression is given by

A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = s o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V

(4)

where we set

d_{k}

as the number of attention heads. Finally, we use M to represent the output of the whole MHA, of which the expression of the whole MHA branch is therefore given as follows:

M = Concat (A t t e n t i o n (Q_{1}, K_{1}, V_{1}), A t t e n t i o n (Q_{2}, K_{2}, V_{2}), \dots, A t t e n t i o n (Q_{i}, K_{i}, V_{i})) W^{O}

(5)

where

W^{O}

represents the weight matrix.

DA branch. In the MHA branch, we extract the global information of the video and map the features of mixed modalities to different semantic spaces. However, the extraction of local information of the video is ignored in the MHA branch. In order to extract the local feature information of the video, we design the DA branch.Yu et al. extracted temporal features at different scales via inflated convolution with different expansion rates, and the fusion of multiscale temporal features brings better results than a single scale [36]. Motivated by their work, we use two inflated convolutions with different expansion rates to extract the different scales of temporal information. Firstly, the original features are respectively subjected to a padding of 4 and 12, then, the two expansion rates of the 4 and 12 expansion convolutions were set, and finally, we used a weighted average to fuse the features of the two expansion convolutions with the following equation:

D = 0.5 * (N o r m (D - C o n v 1 X^{S})) + 0.5 * (N o r m (D - C o n v 2 X^{S}))

(6)

where Norm represents LayerNorm, D-Conv1 represents the dilated convolution with a dilation rate of 4, and D-Conv2 represents the dilated convolution with a dilation rate of 12.

TS branch. After extracting the local and global features of the video, in order to further promote the information flow, we introduce a lightweight adjacent frame fusion branch, which combines the features of adjacent levels on a local scale. The TSM [1] module was originally used to balance the accuracy and computational efficiency of temporal modeling using a 2D convolution and a 3D convolution, where it shifts part of the channel along the time direction, which facilitates the exchange of information between neighboring frames and can be inserted into the 2D CNN to achieve temporal modeling at zero computation and zero parameters. Inspired by this work, we implement it as a TS branch suitable for 1D data by moving the data on the time axis, so that the features on adjacent frames are combined. The moving operation is as follows:

\begin{matrix} X_{j}^{- 1} = X_{j - 1} \end{matrix}

(7)

\begin{matrix} X_{j} 0 = X_{j} \end{matrix}

(8)

\begin{matrix} X_{j} + 1 = X_{j + 1} \end{matrix}

(9)

In the classifier, we concatenate the features extracted by the three branches along the channel dimension. In addition, we believe that the residual connection can help improve the feature expression ability, but in the experiment, we found that only the residual connection for the MHA branch has a better effect. We use T to represent the result of the data movement by the TS branch. The output of the classifier is defined as follows:

O u t = L i n e a r (N o r m (C a t (X^{S} + M, D, T)))

(10)

where Cat represents the concatenation along the channel dimension, Norm represents LayerNorm, and Linear represents the linear layer.

3.3. Training Based on MIL

Following the MIL principle [6,37], we use the average of K-max activations on the temporal dimension instead of all activations to calculate the prediction value, where K is defined as

[\frac{T}{Q} + 1]

. Then, we use the Sigmoid function to map the prediction value to the (0,1) interval. Finally, we calculate the loss between the prediction value and the true label by using the binary cross-entropy loss function.

4. Experiments

4.1. Data Sets and Evaluation Measure

MTDA-Net is evaluated on the XD-Violence dataset [11], which is created for weakly supervised video anomaly detection tasks. The XD-Violence dataset is collected from movies and YouTube (in-the-wild scenes). There are 91 movies in total, where violent movies are used to collect violent and non-violent events, and non-violent movies are only used to collect non-violent events. To prevent the violence detection system from judging violent events based on the scene background rather than the event occurrence, the authors specially collected a large number of non-violent videos that have consistent backgrounds with violent videos. The total duration of the dataset is over 217 h, containing 4754 untrimmed videos, with video-level labels in the training set and frame-level labels in the test set. It is the largest public video anomaly detection dataset to date. Following previous methods [11,13], we use the precision–recall curve (PRC) and the area under the corresponding curve (average precision, AP) [38] as the evaluation metrics for the XD-Violence dataset.

4.2. Implementation Details

Our hardware device uses a graphics processing unit (GPU) named NVIDIA RTX4090, with 24G memory, using the deep learning framework Pytorch, and the programming development language is Python 3.8 version. For all the experiments, we set the learning rate to 0.0001 and the batch size to 128. In the TS branch, we set the shift ratio to eight. In the MHA branch, we set the number of parameters in the hidden layer to 512. In the DA branch, the dilation of the two dilated convolutions, one small and one large, are set to 4 and 12, respectively. In the classifier, each linear layer is followed by a Gelu activation function and a Dropout, and the Dropout ratio is set to 0.1.

4.3. Results on XD-Violence

This paper compares the MTDA-Net to the following baseline:

Multiple instance learning ranking (MIL-Rank) [6] framework by leveraging weakly labeled training videos.
Holistic and localized network (HL-Net) [11] that explicitly exploits relations of snippets and learns powerful representations.
Robust Temporal Feature Magnitude learning (RTFM) [13] trains a feature magnitude learning function to effectively recognize the positive instances, substantially improving the robustness of the MIL approach to the negative instances from abnormal videos.
Causal Temporal Relation and Feature Discrimination (CRFD) [26] consists of four modules to leverage the effect of the temporal cue and feature discrimination.
Normality Guided Multiple Instance Learning (NG-MIL) [39] framework encodes diverse normal patterns from noise-free normal videos into prototypes for constructing a similarity-based classifier.
Self-supervised sparse representation (S3R) [40] framework models the concept of the anomaly at the feature level by exploring the synergy between dictionary-based representation and self-supervised learning.
Discriminative Dynamics Learning (DDL) [30] method have two objective functions, i.e., dynamics ranking loss and dynamics alignment loss.
Uncertainty Regulated Dual Memory Units (UR-DMU) [25] model can learn both the representations of normal data and discriminative features of abnormal data.
Modality-aware contrastive instance learning with self-distillation (MACIL-SD) [12] focuses on the modality’s heterogeneousness.
Contrastive Attention Video Anomaly Detection (CA-VAD) [41] fully utilizes enough normal videos to train a classifier with a good discriminative ability for normal videos.
Multi-Sequence Learning (MSL) [42] uses a sequence composed of multiple snippets as an optimization unit.

As shown in Table 1, compared with the unsupervised methods, our method is 33.66% higher than the SVM baseline on AP. Compared with the weakly supervised methods, the proposed is 2.78% higher than the RGB-only method UR-DMU [25]. Using the same multimodal data, our method is 1.04% higher than the MACIL-SD [12] that uses a dual-stream network to generate audio and visual bags, and also 3.01% higher than the method that uses a multi-head classification module [43]. On the MIL-based anomaly behavior classification, our multimodule feature fusion method greatly improves the effectiveness of network learning. The PRC curve is shown in Figure 2.

4.4. Ablation Studies

We performed extensive ablation studies to validate the contribution of each component of MTDA-Net.

(1) The contribution of each component: Table 2 shows the contribution of each component, where “Baseline” refers to the combination of MLP. It is noteworthy that the DA branch brings a 5.36% improvement, which is higher than the 4.79% improvement brought by the MHA branch and the 3.22% improvement brought by the TSM module. We think this may be due to the good performance of the multi-scale information inherent in the DA branch. In addition, we find that the information fusion of different modules can also bring some accuracy improvement, which proves that the information of different modules has a complementary effect. Moreover, the fusion of the MHA branch and the DA branch brings a much higher effect than any other two branch combinations, reaching 83.43%. We think this is because the global features extracted by the MHA branch and the multi-scale local features extracted by the DA branch have larger differences, so they have a better complementary effect. In contrast, the fusion of the TS branch and the DA branch only reaches 80.86%, which is because both the TS branch and DA branch focus on local feature extraction and ignore global features. This also proves to us that global features and local features are equally important in the process of anomaly behavior detection.

(2) The contribution of the MHA branch: Figure 3 shows the contribution of the MHA branch under different numbers of attention heads. When the number of attention heads is 2, we achieve the best performance and the AP is improved by 1.41% compared with the case when the number of attention heads is 1. When the number of attention heads exceeds 2, more attention heads do not bring a higher improvement and the AP tends to be flat. We think this is because both image information and sound information are input at the same time, and these two modalities of information are mapped to different semantic spaces, which brings higher discriminability. More attention heads will destroy this relationship, leading to the decrease in AP.

(3) The contribution of the DA branch: the MHA branch tends to extract global features different from the MHA branch, where the DA branch tends to fuse different scale local features and has the best effect in the contribution of a single branch, which proves the effectiveness of multi-scale local feature fusion. To further explore the fusion effect in the DA branch, we set different fusion ratios for the two dilated convolutions, of which the experimental results are shown in Table 3. The results show that the best effect is achieved when the proportions of the two features are balanced. At the same time, we find that when the ratio of the smaller scale dilated convolution D-Conv1 is 0.9, it is 0.95% higher than when the ratio of D-Conv1 is 0.1. We think that the smaller scale convolution contributes more in the DA branch because the global features have been provided in the MHA branch, so when combined with the DA branch, the smaller scale dilated convolution plays a more important role. In addition, we try to set the ratio coefficient as a learnable parameter, but the experimental results do not achieve better results.

(4) The contribution of the TS branch: the TS branch performs temporal modeling of temporal information through data shifting. For the TS branch, we conducted detailed ablation experiments on the position of data shifting. Right means data shifting on the right side of the feature, and Left means data shifting on the left side of the feature. The experimental results are shown in Table 4. The experimental results are not much different when data shifting is performed on the right side of the feature and when data shifting is performed on the left side of the feature. However, after data shifting is performed on both ends of the data, the experimental results are the best.

4.5. Qualitative Results

To further demonstrate the effectiveness of MTDA-Net, we visualize the anomaly scoring results on the validation set of XD-Violence. As shown in Figure 4, on the validation set of XD-Violence, we randomly sampled four videos (Bad.Boys.1995 #01-33, Black.Hawk.Down2001 #01-13, Brick.Mansions2014 #00-16, and BlackHawkDown2001 #01-32). MTDA-Net can represent normal events with smaller scores and anomaly events with larger scores. In addition, as shown in the videos (Black.Hawk.Down2001 #01-32, Black.Hawk.Down2001 #01-13), MTDA-Net can predict the anomaly scores of multi-segment anomalous events and of long-term anomalous events in the XD-Violence dataset with relative accuracy. Furthermore, as can be seen from the figure, the rise in anomaly scores tends to have a certain upward curve, and, even among the labeled anomalous regions, there is a significant portion of regions where the anomaly scores are not very high, which we found out by watching the original video, determining that even among the anomalous regions labeled in the dataset, a portion of the normal picture exists.

5. Conclusions

In this paper, we propose a novel multi-modal temporal relation aggregation network MTDA-Net for weakly supervised anomaly behavior detection, which achieves an AP of 84.44% on the XD-Violence dataset. In addition, we construct a new MTDA module consisting of three complementary branches: MHA, DA, and TS, which has a plug-and-play feature, has good results for multi-modal data, and can significantly improve the accuracy of anomaly localization. Furthermore, we conduct detailed ablation experiments on the XD-Violence dataset, which fully validate the effectiveness of the proposed MTDA module. The experimental results demonstrate the validity of the proposed anomaly behavior detection method.

Based on the good results of the proposed MTDA-Net model, we believe that weakly supervised learning has great potential for further development in abnormal behavior detection. In MTDA-Net, the synergistic effect of multiple branches brings higher AP, but at the same time, it also brings more parameters. Simplifying MTDA-Net and using knowledge distillation to obtain lighter models is a direction of our future work.

Author Contributions

H.W.: Conceptualization, Methodology, and Writing—review and editing. M.Y.: Investigation, Data collection, Data interpretation, Software, and Writing—original draft. F.W.: Supervision and Writing—review and editing. G.S.: Validation and Visualization. W.J.: Validation and Writing—reviewing. Y.Q.: Funding acquisition and Resources. H.D.: Investigation and Software. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62272163), Key Research Projects of Henan Higher Education Institutions (No. 23A520031), and Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness (No. HNTS2022005), Key Research Projects of Henan Higher Education Institutions (No. 24A520020).

Data Availability Statement

The data used in this study are openly available in https://roc-ng.github.io/XD-Violence (accessed on 1 October 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republich of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
Feichtenhofer, C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 203–213. [Google Scholar]
Fu, W.; An, Z.; Huang, W.; Sun, H.; Gong, W.; Gonzàlez, J. A Spatio-Temporal Spotting Network with Sliding Windows for Micro-Expression Detection. Electronics 2023, 12, 3947. [Google Scholar] [CrossRef]
Al-Dhamari, A.; Sudirman, R.; Mahmood, N.H.; Khamis, N.H.; Yahya, A. Online video-based abnormal detection using highly motion techniques and statistical measures. TELKOMNIKA (Telecommun. Comput. Electron. Control) 2019, 17, 2039–2047. [Google Scholar] [CrossRef]
Antić, B.; Ommer, B. Video parsing for abnormality detection. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2415–2422. [Google Scholar]
Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar]
Wang, L.; Zhou, F.; Li, Z.; Zuo, W.; Tan, H. Abnormal event detection in videos using hybrid spatio-temporal autoencoder. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 2276–2280. [Google Scholar]
Smeureanu, S.; Ionescu, R.T.; Popescu, M.; Alexe, B. Deep appearance features for abnormal behavior detection in video. In Proceedings of the Image Analysis and Processing-ICIAP 2017: 19th International Conference, Catania, Italy, 11–15 September 2017; Part II 19. Springer: Berlin/Heidelberg, Germany, 2017; pp. 779–789. [Google Scholar]
Akcay, S.; Atapour-Abarghouei, A.; Breckon, T.P. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Proceedings of the Computer Vision—ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers, Part III 14. Springer: Berlin/Heidelberg, Germany, 2018; pp. 622–637. [Google Scholar]
Demarty, C.H.; Penet, C.; Soleymani, M.; Gravier, G. VSD, a public dataset for the detection of violent scenes in movies: Design, annotation, analysis and evaluation. Multimed. Tools Appl. 2015, 74, 7379–7404. [Google Scholar] [CrossRef]
Wu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; Yang, Z. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XXX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 322–339. [Google Scholar]
Yu, J.; Liu, J.; Cheng, Y.; Feng, R.; Zhang, Y. Modality-aware contrastive instance learning with self-distillation for weakly-supervised audio-visual violence detection. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 6278–6287. [Google Scholar]
Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J.W.; Carneiro, G. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4975–4986. [Google Scholar]
Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 156–165. [Google Scholar]
Farha, Y.A.; Gall, J. Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3575–3584. [Google Scholar]
Ullah, A.; Ahmad, J.; Muhammad, K.; Sajjad, M.; Baik, S.W. Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 2017, 6, 1155–1166. [Google Scholar] [CrossRef]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar]
Lee, S.; Kim, H.G.; Ro, Y.M. BMAN: Bidirectional multi-scale aggregation networks for abnormal event detection. IEEE Trans. Image Process. 2019, 29, 2395–2408. [Google Scholar] [CrossRef] [PubMed]
Mehran, R.; Oyama, A.; Shah, M. Abnormal crowd behavior detection using social force model. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 935–942. [Google Scholar]
Zhao, B.; Fei-Fei, L.; Xing, E.P. Online detection of unusual events in videos via dynamic sparse coding. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3313–3320. [Google Scholar]
Lu, C.; Shi, J.; Jia, J. Abnormal event detection at 150 fps in matlab. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 2720–2727. [Google Scholar]
Li, W.; Mahadevan, V.; Vasconcelos, N. Anomaly detection and localization in crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 18–32. [Google Scholar]
Ruff, L.; Vandermeulen, R.A.; Görnitz, N.; Binder, A.; Müller, E.; Müller, K.R.; Kloft, M. Deep semi-supervised anomaly detection. arXiv 2019, arXiv:1906.02694. [Google Scholar]
Pu, Y.; Wu, X. Audio-guided attention network for weakly supervised violence detection. In Proceedings of the 2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 14–16 January 2022; pp. 219–223. [Google Scholar]
Zhou, H.; Yu, J.; Yang, W. Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection. arXiv 2023, arXiv:2302.05160. [Google Scholar] [CrossRef]
Wu, P.; Liu, J. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Trans. Image Process. 2021, 30, 3513–3527. [Google Scholar] [CrossRef] [PubMed]
Pu, Y.; Wu, X.; Wang, S. Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection. arXiv 2023, arXiv:2306.14451. [Google Scholar]
Zhu, Y.; Newsam, S. Motion-aware feature for improved video anomaly detection. arXiv 2019, arXiv:1907.10211. [Google Scholar]
Zhong, J.X.; Li, N.; Kong, W.; Liu, S.; Li, T.H.; Li, G. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1237–1246. [Google Scholar]
Pu, Y.; Wu, X. Locality-Aware Attention Network with Discriminative Dynamics Learning for Weakly Supervised Anomaly Detection. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? In A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Lv, H.; Zhou, C.; Cui, Z.; Xu, C.; Li, Y.; Yang, J. Localizing anomalies from weakly-labeled videos. IEEE Trans. Image Process. 2021, 30, 4505–4515. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Qing, L.; Miao, J. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 4030–4034. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Paul, S.; Roy, S.; Roy-Chowdhury, A.K. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 563–579. [Google Scholar]
Perez, M.; Kot, A.C.; Rocha, A. Detection of real-world fights in surveillance videos. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2662–2666. [Google Scholar]
Park, S.; Kim, H.; Kim, M.; Kim, D.; Sohn, K. Normality Guided Multiple Instance Learning for Weakly Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2665–2674. [Google Scholar]
Wu, J.C.; Hsieh, H.Y.; Chen, D.J.; Fuh, C.S.; Liu, T.L. Self-supervised sparse representation for video anomaly detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 729–745. [Google Scholar]
Chang, S.; Li, Y.; Shen, S.; Feng, J.; Zhou, Z. Contrastive attention for video anomaly detection. IEEE Trans. Multimed. 2021, 24, 4067–4076. [Google Scholar] [CrossRef]
Li, S.; Liu, F.; Jiao, L. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 1395–1403. [Google Scholar]
Zhang, C.; Li, G.; Qi, Y.; Wang, S.; Qing, L.; Huang, Q.; Yang, M.H. Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16271–16280. [Google Scholar]
Schölkopf, B.; Williamson, R.C.; Smola, A.; Shawe-Taylor, J.; Platt, J. Support vector method for novelty detection. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999. [Google Scholar]
Hasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A.K.; Davis, L.S. Learning temporal regularity in video sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 733–742. [Google Scholar]

Figure 1. The overall structure of the MTDA-Net model, where the white blocks represent Padding and the other colored blocks represent normal features. D-Conv1 represents a dilated convolution with a dilation rate of 4, D-Conv2 represents a dilated convolution with a dilation rate of 12, and + represents weighted average.

Figure 2. PRC curve of MTDA-Net on XD-Violence.

Figure 3. Contribution of the MHA branch.

Figure 4. A qualitative results of MTDA-Net on randomly sampled samples from the XD-Violence validation set. In each sample image, the first row of green and red boxes represent normal and anomaly key frames, respectively, the second row of x-axis represents the video frame number, y-axis represents the anomaly score, blue line represents the anomaly score, and pink window indicates the anomaly region.

Table 1. Performance comparison on the XD-violence dataset.

Supervision	Method	Feature	AP (%)
Unsupervised	SVM baseline	-	50.78
	OCSVM [44]	-	27.25
	Hasan et al. [45]	-	30.77
Weakly Supervised	MIL-Rank [6]	C3D RGB	73.20
	HL-Net [11]	I3D RGB	75.44
	CA-VAD [41]	I3D RGB	76.90
	RTFM [13]	I3D RGB	77.81
	CRFD [26]	I3D RGB	75.90
	MSL [42]	I3D RGB	78.28
	NG-MIL [39]	I3D RGB	78.51
	S3R [40]	I3D RGB	80.26
	DDL [30]	I3D RGB	80.72
	Zhang et al. [43]	I3D+VGGish	81.43
	UR-DMU [25]	I3D RGB	81.66
	MACIL-SD [12]	I3D+VGGish	83.40
	Ours	I3D+VGGish	84.44

Table 2. Contributions of different branches of MHA, TS, and DA. ✘ indicates that the branch is not used in the method and ✔ indicates that the branch is used in the method.

Baseline	MHA	TS	DA	XD-Violence AP (%)
✔	✘	✘	✘	74.84
✔	✔	✘	✘	79.63
✔	✘	✔	✘	78.06
✔	✘	✘	✔	80.2
✔	✔	✔	✘	82.58
✔	✔	✘	✔	83.43
✔	✘	✔	✔	80.86
✔	✔	✔	✔	84.44

Table 3. Contributions of the DA branch, where

α

denotes learnable parameters.

Table 3. Contributions of the DA branch, where

α

denotes learnable parameters.

D-Conv1	D-Conv2	XD-Violence AP (%)
$α$	$1 - α$	83.66
0.1	0.9	82.85
0.3	0.7	82.15
0.7	0.3	83.79
0.9	0.1	83.80
0.5	0.5	84.44

Table 4. Contributions of the TS branch. ✘ indicates not used in the method and ✔ indicates used in the method.

Right	Left	XD-Violence AP (%)
✔	✘	83.30
✘	✔	83.74
✔	✔	84.44

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, H.; Yang, M.; Wei, F.; Shi, G.; Jiang, W.; Qiao, Y.; Dong, H. Weakly-Supervised Video Anomaly Detection with MTDA-Net. Electronics 2023, 12, 4623. https://doi.org/10.3390/electronics12224623

AMA Style

Wu H, Yang M, Wei F, Shi G, Jiang W, Qiao Y, Dong H. Weakly-Supervised Video Anomaly Detection with MTDA-Net. Electronics. 2023; 12(22):4623. https://doi.org/10.3390/electronics12224623

Chicago/Turabian Style

Wu, Huixin, Mengfan Yang, Fupeng Wei, Ge Shi, Wei Jiang, Yaqiong Qiao, and Hangcheng Dong. 2023. "Weakly-Supervised Video Anomaly Detection with MTDA-Net" Electronics 12, no. 22: 4623. https://doi.org/10.3390/electronics12224623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Weakly-Supervised Video Anomaly Detection with MTDA-Net

Abstract

1. Introduction

2. Related Work

2.1. Supervised and Semi-Supervised Anomaly Detection

2.2. Weakly Supervised Anomaly Detection

3. Methods

3.1. Notations and Preliminaries

3.2. MTDA Module

3.3. Training Based on MIL

4. Experiments

4.1. Data Sets and Evaluation Measure

4.2. Implementation Details

4.3. Results on XD-Violence

4.4. Ablation Studies

4.5. Qualitative Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI