Next Article in Journal
An Image Object Detection Model Based on Mixed Attention Mechanism Optimized YOLOv5
Next Article in Special Issue
A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning
Previous Article in Journal
Research on Intelligent Disinfection-Vehicle System Design and Its Global Path Planning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Crowd Anomaly Detection by Fusing Reconstruction and Prediction Networks

Department of Information and Communication Technology, University of Agder, 4879 Grimstad, Norway
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(7), 1517; https://doi.org/10.3390/electronics12071517
Submission received: 31 January 2023 / Revised: 13 March 2023 / Accepted: 14 March 2023 / Published: 23 March 2023

Abstract

:
Abnormal event detection is one of the most challenging tasks in computer vision. Many existing deep anomaly detection models are based on reconstruction errors, where the training phase is performed using only videos of normal events and the model is then capable to estimate frame-level scores for an unknown input. It is assumed that the reconstruction error gap between frames of normal and abnormal scores is high for abnormal events during the testing phase. Yet, this assumption may not always hold due to superior capacity and generalization of deep neural networks. In this paper, we design a generalized framework (rpNet) for proposing a series of deep models by fusing several options of a reconstruction network (rNet) and a prediction network (pNet) to detect anomaly in videos efficiently. In the rNet, either a convolutional autoencoder (ConvAE) or a skip connected ConvAE (AEc) can be used, whereas in the pNet, either a traditional U-Net, a non-local block U-Net, or an attention block U-Net (aUnet) can be applied. The fusion of both rNet and pNet increases the error gap. Our deep models have distinct degree of feature extraction capabilities. One of our models (AEcaUnet) consists of an AEc with our proposed aUnet has capability to confirm better error gap and to extract high quality of features needed for video anomaly detection. Experimental results on UCSD-Ped1, UCSD-Ped2, CUHK-Avenue, ShanghaiTech-Campus, and UMN datasets with rigorous statistical analysis show the effectiveness of our models.

1. Introduction

Detection of abnormal events in automated video surveillance systems is one of the most challenging, overriding, and time-sensitive tasks. Recently, deep-learning-based algorithms have been dominating the literature as the deep learning solutions for crowd events detection have outperformed the conventional machine learning solutions. Motion and appearance features are widely used in video anomaly detection algorithms. In deep-learning-based video anomaly detection algorithms, a common technique is to build reconstruction model considering motion and/or appearance features. A common assumption is that the reconstruction error of the frame of normal event is small but that of the frame of abnormal event is large [1,2,3]. To learn normal data patterns of videos, the deep model is trained solely on videos of normal events. Consequently, during testing with videos of normal events, the deep model demonstrates its ability to show normal events with low reconstruction error, but the deep model suffers from exhibiting high reconstruction error needed for abnormal events. As a result, the error gap between the low reconstruction error and the high reconstruction error differentiates the normal and abnormal events in videos. Normally, research in this direction is targeted to increase this error gap [2,3]. In brief, a larger error gap plays the vital role to detect anomaly in videos.
A burning question is: can the reconstruction-based model guarantee the expected large reconstruction error (i.e., high error gap) of the anomaly? Liu et al. [2] claimed that the deep model trained by minimizing the reconstruction error of normal data cannot guarantee a higher reconstruction error of an abnormal event at the testing phase. Further, Gong et al. [4] stated that abnormal events may not correspond to larger reconstruction errors due to the improved capacity and generalization of deep neural network. Thus, reconstruction errors of normal and abnormal events will be indistinguishable, resulting in a very small error gap [2]. Both Gong et al. [4] and Park et al. [5] suggested the addition of a memory module for solving this pitfall. Nonetheless, the restricted memory cannot fully reveal the distinctiveness of normal events and the effective size of memory is not facile to find out [3]. To keep away from this problem, Zhong et al. [3] adopted a cascade reconstruction model to increase the reconstruction error of anomaly in videos. Motivated by the performance of the video prediction model of Mathieu et al. [6], Liu et al. [2] presented an appearance-motion model for video frame prediction that applied a U-Net structure [7] to predict a frame from a number of recent ones and then estimated the corresponding optical flow. Their model was optimized according to the difference between the output and original versions of video frame as well as the optical flow together with an adversarial loss.
In this paper, we design a generalized architecture (rpNet) as shown in Figure 1, which includes a group of different deep models. Each model integrates an rNet (an image frame reconstruction network or an appearance-only stream) and a pNet (a video frame prediction network or an appearance-motion stream), in which every stream possesses its own contribution for the task of detecting abnormal frames. Both streams can promise substantial anomaly scores. The fusion of outputs from two streams guarantees a certain degree of augmentation of the error gap. Our approach is inspired by the Zhong et al. [3] model but with distinct modules and designs. Primarily, Zhong et al. [3] applied a traditional autoencoder (AE) as an rNet and the squeeze-and-excitation network of Hu et al. [8] as a pNet to handle motion. Differently, we apply a convolutional AE (ConvAE) or a skip connected ConvAE (AEc) as an rNet and we adopt Liu et al.’s [2] future frame prediction model as a pNet to handle appearance and motion. The performance of a ConvAE in rNet is better than a traditional AE. The ConvAE extends the basic structure of the simple AE by changing the fully connected layers to convolution layers. The ConvAE is more suitable for the images as it uses a convolution layer. The reason of choosing Liu et al. [2] prediction model is that a fixed and optimized procedure of optical flow estimation (e.g., FlowNet [9]) is embedded in it. Mainly, Liu et al. [2] applied a traditional U-Net [7] as the heart of their model. We also employ a traditional U-Net [7] as the first option of our pNet. Aside from a traditional U-Net [7], we propose to use two more of its derivatives, namely a non-local block U-Net and an attention block U-Net (aUnet), for performance improvements.
A U-Net [7] is an improved CNN (convolutional neural network) model that can train data with fewer samples and segment images more accurately, but its efficiency and effectiveness can be limited by using the local operators (e.g., convolutions and down-sampling operators) only [10]. However, non-local blocks can strengthen the temporal and spatial characteristics and establish the long-distance dependencies of video frames [11]. Buades et al. [12] explained non-local mean operation, and later Wang et al. [11] wrapped the non-local operation into a non-local block. A new non-local block can be inserted in a U-Net [7] without breaking its initial behavior [10]. Because of this, Zhang et al. [13] adopted three non-local blocks in their U-Net frame prediction model to detect surveillance video anomaly. However, Wang et al. [11] showed that more non-local blocks lead to better performance. To this end, we adopt four non-local blocks in the U-Net architecture as the second option of our pNet. In addition to non-local blocks, attention mechanism puts down less fitting features and highlights more salient features. Oktay et al. [14] introduced attention gates in the intermediate layers of a U-Net architecture for pancreas segmentation. Yet, due to better breast-tumor segmentation performance in ultrasound images, Vakanski et al. [15] applied attention blocks at beginning layers of a U-Net architecture. Following Vakanski et al. [15], we propose an aUnet as the third option of our pNet. There exist some internal architectural differences at our proposed aUnet from Vakanski et al. [15]. For example, Vakanski et al. [15] employed external auxiliary inputs in the form of visual saliency maps, whereas we employ an internal motion saliency map and original video frame as inputs of the aUnet.
We presume that if any frame f t contains an appearance anomaly then our rNet can improve its determinability, whereas if f t contains an appearance-motion anomaly, then our pNet can improve its determinability. The rNet and pNet enforce both the reconstructed frame and the predicted frame to be close to their ground truth frame, respectively. Therefore, we combine the error scores of both networks to calculate the final anomaly score of each frame for detecting its anomalousness by considering the anomaly scores of consecutive multi-frames (e.g., past, present, and future frames). This also helps to exploit the persistent flow of abnormal events. In essence, we propose six deep models by combining two-alternative of rNets and three-alternative of pNets from our generalized framework in Figure 1: (1) AE-Unet (convolutional autoencoder and U-Net), (2) AEcUnet (convolutional autoencoder with skip connection and U-Net), (3) AEnUnet (convolutional autoencoder and non-local block U-Net), (4) AEcnUnet (convolutional autoencoder with skip connection and non-local block U-Net), (5) AEaUnet (convolutional autoencoder and attention block U-Net), and (6) AEcaUnet (convolutional autoencoder with skip connection and attention block U-Net). Although these models can provide an improved error gap for abnormal events, they have different degrees of feature extraction capabilities required for crowd video anomaly detection. Consequently, in experimental setups, some of these models showed inferior results, while others presented superior results. For example, AEcaUnet demonstrated the best results and outperformed its alternatives by both confirming better error gap and extracting high quality features from the available videos.
Our key contributions are summarized as follows:
  • We propose six different deep models for crowd anomaly detection by designing a generalized framework (rpNet).
  • We propose an aUnet (see Figure 2) for an option of the pNet of our rpNet architecture.
  • Experiments on five benchmark datasets and a rigorous statistical analysis demonstrate the potential of our models with competitive performance compared with the state-of-the-art models.
The rest of this paper is organized as follows. Section 2 addresses the most relevant previous studies. Section 3 overviews our generalized architecture of rpNet. Section 4 discusses the rNet of our rpNet. Section 5 illustrates the pNet of our rpNet. Section 6 exemplifies mainly the non-local block U-Net and our proposed aUnet. Section 7 illustrates anomaly detection on testing datasets. Section 8 hints a simulation to show that a larger error gap is guaranteed by rpNet. Section 9 explains experimental setup and results on publicly datasets. Section 10 compares our experimental results with the state-of-the-art methods. Section 11 makes a rigorous statistical analysis to find superiority among models. Section 12 concludes the paper.

2. Related Work

The related work can be classified into three groups, presented below.

2.1. Frame Reconstruction-Based Models

The following articles are primarily based on frame reconstruction and calculation of related errors. Xu et al. [16] proposed a multi-layer autoencoder (AE) for feature learning, which demonstrated the potency of deep learning features. Hasan  et al. [1] designed a three-dimensional convolutional autoencoder (ConvAE) for modeling regular frames. Chong et al. [17] took the advantages of both convolutional neural network (CNN) and recurrent neural network (RNN) for simultaneously modeling of the normal appearance and motion patterns. Luo et al. [18] proposed a temporally coherent sparse coding-based method, which can map to a stacked RNN framework. Sabokrou et al. [19] trained a generative adversarial network (GAN) similar to an adversarial network, in which a reconstruction component learned to reconstruct the normal test frames. However, all these reconstruction-based models assume that abnormal events can correspond to higher reconstruction errors, but this assumption may not necessarily hold always [2].

2.2. Frame Prediction-Based Models

The following studies mainly focus on how to predict future frames indirectly or directly. Shi et al. [20] modified the original long short-term memory (LSTM) along with a convolutional LSTM for precipitation forecasting. Mathieu et al. [6] proposed a multi-scale network with adversarial training for creating more natural future frames in videos. Giorno et al. [21] designed a deep model for detecting changes on a sequence of data from videos to see which frames were distinguishable from all the previous frames. By processing the video online Ionescu et al. [22] performed a similar work to that of Giorno et al. [21]. Lotter et al. [23] designed a deep predictive neural network for video prediction and unsupervised learning. Some studies (e.g., [24,25]) move to predict transformations required for creating future frames, which boosted the performance of video prediction to a greater extent. For example, Liu et al. [2] facilitated spatial and motion constraints for predicting future frame with normal events considering U-Net structure [7]. Their model also facilitated to detect those anomalies that do not agree the assumption. Their model was optimized according to the difference between the output and the original versions of video frame as well as the optical flow together with an adversarial loss. Doshi et al. [26] predicted the future video frame using previous video frames for video anomaly detection. To detect surveillance video anomaly, Zhang et al. [13] included the non-local block [11] in the U-Net [7] as a generator to generate high-quality prediction frames.

2.3. Reconstruction and Prediction-Based Models

The deep model of Nguyen et al. [27] consisted of three streams namely common encoder, appearance decoder, and motion decoder. Each stream had its own benefaction to detect exceptional frames. Basically, they combined a ConvAE for appearance along with a U-Net [7] for motion prediction. Their encoder was constructed by a sequence of blocks including convolution, batch-normalization, and leaky ReLU (rectified linear unit) activation. The decoder of their U-Net [7] had the same structure as the ConvAE except for the skip connections. Zhong et al. [3] proposed a cascaded model composed of a frame reconstruction network and an optical flow prediction network. By predicting optical flow based on reconstruction frame, their model increased the gap of prediction error of optical flow containing abnormal events.
The deep model of Liu et al. [28] composed of a prediction network, a reconstruction network, and a generative adversarial network (GAN). The prediction network integrated hybrid dilated convolution (HDC) [29] and DB-ConvLSTM [30] strategies to widen the gap between normal and abnormal events, while reconstruction network used an AE structure.
Inspired by the success of the video prediction model of Liu et al. [2], we adopt a U-Net structure to predict a frame from a number of recent ones and then estimate the corresponding optical flow. Similar to Liu et al. [2], a fixed procedure of optical flow estimation (e.g., FlowNet [9]) is embedded inside our pNet. The purpose of our rNet (i.e., ConvAE) is to learn the regular appearance structures. Table 1 summaries a qualitative comparison of the most relevant works.

3. Overview of the Generalized Architecture (rpNet)

Fundamentally, we design a generalized architecture named rpNet as depicted in Figure 1. It consists of two neural networks connected in parallel namely reconstruction network (rNet) and prediction network (pNet). The rNet depends on appearance only as it works with images, but the pNet relies on both appearance and motion as it works with video frames. The key difference between images and video frames is that the video frames are sequential and correlated, whereas the images are static. Video frames need to be measured in both space and time dimensions, but images need to be measured in space dimension. The rpNet includes information of both images and video frames simultaneously.
Machine learning methods in computer vision and image processing problems [37] have been applied for a good deal of research applications (e.g., [38,39,40,41,42,43,44,45,46,47,48,49,50,51]). Deep learning is a subset of machine learning that utilizes huge volumes of data and sophisticated algorithms for training a model. Nowadays, deep learning models are used to detect anomalies in various kinds of applications (e.g., [52,53,54,55,56,57]). The extraction of appropriate features plays a decisive role for detecting anomalies in deep learning models. Recently, due to powerful capability of deep learning models in reconstruction, it has unquestionably made advancement in abnormal event detection tasks. The video anomaly detection models (e.g., [1,52]) indicated that convolution is predominantly applied for extracting features. Thereupon, such structure scarcely encodes temporal dependencies in a long video sequence. Basically, our rNet is a convolutional autoencoder, which is similar to those models [1,52]. Figure 3 details the two variants of the presented block diagram of the rNet in Figure 1. In general, the rNet comprises an encoding path and a decoding path. The block diagram of the pNet in Figure 1 is typically a prediction network of Liu et al. [2] to predict future frames. One of its most important components is its generator, which is a traditional U-Net [7]. However, we propose to use either non-local block U-Net or attention block U-Net as discussed in Section 6.
In a nutshell, the rpNet both reconstructs the current frame using its rNet for scoring reconstruction error and predicts the future frame using its pNet for scoring prediction error in a parallel manner for anomaly detection by providing better error gaps via information fusion (e.g., see Section 8). Both rNet and pNet can show some degree of performance, but the performance of rpNet is better than that of either rNet or pNet individually. The straightforward simulation in Section 8 and later the experimental results support this proposition. Essentially, the rpNet brings about six separate models namely AE-Unet, AEcUnet, AEnUnet, AEcnUnet, AEaUnet, and AEcaUnet by combining the two-variant of rNets and the three-variant of pNets.
Figure 3a demonstrates encoder and decoder networks of our ConvAE without skip connection. The encoder network consists in a stack of four hidden layers with convolutional filters of 64, 128, 256, and 512, kernel sizes of 5, 5, 3, and 3, and strides of (1,2), (2,2), (2,2), and (2,2), respectively. Regarding the decoder network, it has four transposed convolutional layers that mirror the encoder layers. Due to the loss of some features, the reconstructed image of ConvAE may not match exactly with the input image. The difference L ( f , f ^ ) between the original input f and the reconstructed f ^ is called the reconstruction error. The learning process of ConvAE is to minimize the reconstruction error. Loss functions play an important role in achieving the desired reconstructed image.

4. Appearance-Only Stream

In this section, we discuss in detail our adopted two alternative reconstruction networks. A ConvAE is used to extract the salient features by performing filter operations on the original input image, whereas AEc boosts the performance with a notable margin.

4.1. ConvAE

The CNN has strong capability to learn spatial features [2]. Usually, a CNN consists of convolutional layer, activation layer, pooling layer, and up-sampling layer. It uses these layers to extract features from the two-dimensional (2D) data structure of images and then followed by the sub-sampling or pooling layer. We can add a dense or feedforward layer to the CNN for classification tasks, or we can add an upsampling layer to increase the resolution of the feature maps for image generation tasks. Activation layers consist of activation functions (e.g., ReLU, Sigmoid, Softmax, Tanh, and Linear), which introduce non-linearity into the deep neural network. Without this non-linearity, the deep neural networks are only able to perform the linear mapping between inputs and the outputs. The pooling or sub-sampling layers can reduce the spatial dimensions of the feature maps (e.g., from 256 × 256 to 128 × 128). Theoretically, we can eliminate the down/up sampling layers altogether. The up-sampling layer performs the reverse of the pooling layer. It is used to increase the dimensions of the incoming feature maps (e.g., from 128 × 128 to 256 × 256). The up-sampling layer is generally used in generative tasks. Leaky ReLU decreases certain positive values to 0 if they are close enough to zero. In dropout technique, randomly selected neurons are ignored during training. A good value for dropout in a hidden layer is between 0.50 and 0.80.
The AE is primarily used for image reconstruction. The AE that employs CNN mimics its input to its output as close as possible. It aims to take an input, transform it into a reduced representation called code or embedding. Then, this code or embedding is transformed back into the original input. The code is also called the latent-space representation. An AE consists of two leading parts namely an encoder and a decoder. Stacking encoders and decoders with multiple hidden layers can form a deep autoencoder. The encoder extracts features by gradually reducing the spatial resolution, whereas the decoder gradually recovers the frame by increasing the spatial resolution. The encoder maps the input into the code, whereas the decoder maps the code to a reconstruction of the input. Fully connected AE ignores 2D image structure [58]. The ConvAE extends the basic structure of the simple AE by changing the fully connected layers to convolution layers. The ConvAE is very suited for the images as it uses a convolution layer. The convolutional layers are excellent for extracting features from the images or other 2D data without modifying (reshaping) their structure. An encoder can employ convolutional layer, batch normalization layer, an activation function, and a max-pooling function for reducing the dimensions of the feature maps. After a specific number of layers, when the encoder is complete, the feature maps are flattened and a dense layer is used for the latent-space representation. The deconvolution is used for the up-sampling of the incoming feature maps, which is usually followed by the batch normalization and the activation function. Kernel size is one of important parameters in CNN. The smaller the size of kernel, the more effective the preserving details of the original image and the lower the computational cost of network. However, the extreme kernel size of 1 × 1 extracts local information from an image without considering spatial relationship of pixel. Therefore, we can set the size of kernel to 5 and 3 for both considering the spatial relationship of pixel and reducing computational cost.

4.2. Loss Function

The performance of ConvAE depends on input data and the loss function. The goal of training is to minimize the loss. When the main goal of the ConvAE is to solely reconstruct the input as accurate as possible, the loss function of MSE or Kullback–Leibler (KL) divergence [59] can be used. The intensity loss L i n t r of the reconstruction network can be calculated by Equation (1) on minimizing the distance measured by l 2 -norm between f t ^ and f t as:
L i n t r ( f t ^ , f t ) = | | f t ^ f t | | 2 2 .
The gradient loss L g d r of reconstruction network can be calculated by Equation (2) as:
L g d r ( f t ^ , f t ) = i , j | f t ^ i , j f t ^ i 1 , j | | f t i , j f t i 1 , j | 1 + | f t ^ i , j f t ^ i , j 1 | | f t i , j f t i , j 1 | 1 .

4.3. Replacing ConvAE by AEc

If ConvAE goes deeper or applying operations including max pooling, it cannot work very well even with deconvolution layers. A performance degradation problem is encountered when deeper networks start converging [60]. This is possibly due to the fact that a big amount of image details could be lost or corrupted during the convolution and the pooling. This drawback saturates the performance of the network as the depth of network expands. Specially, if the ConvAE encounters this type of problem, it is arduous to learn the details from the data. To minimize this problem, inspired by He et al. [60], we add skip connections between two corresponding convolutional and deconvolutional layers as shown in Figure 3b. The response from a convolutional layer is directly propagated to the corresponding mirrored deconvolutional layer, both forwardly and backwardly. The skip connections between the corresponding encoder and decoder layers allows the network to converge to a better optimum in pixel-wise prediction problems [61]. Let the outputs from the encoder layer and the corresponding decoder layer be O u t e l i and O u t d l i , respectively. The input to the next decoder layer I n d l i + 1 is calculated by Equation (3) as:
I n d l i + 1 = O u t e l i O u t d l i .
Each skip connection complements the data loss due to the data compression in the encoder part by combining the encoder convolutional layer output and the up-sampling output. Through skip connections, each feature map of the corresponding encoder and decoder are summed element-wise, which helps the network to recover the image well.

5. Appearance-Motion Stream

In this section, we discuss details of our prediction network and summarize the loss functions for optimization.
Only appearance constraints cannot guarantee to characterize the motion information well. Further, both spatial and temporal information is an important feature of videos. Inspired by Liu et al. [2], we used an optical flow constraint into the objective function to guarantee the motion consistency for normal events in training set, which further boosts the performance for anomaly detection. The pipeline of our video frame prediction network is shown in Figure 1, where we adopt a traditional U-Net [7] as generator to predict next frame. The traditional U-Net [7] is a fully convolutional neural network, and it uses convolutional and pooling layers. To reduce the number of parameters, it does not have any fully connected layer. It contains a contraction path and an expansion path. Its contraction path is employed to extract the features through the convolutional layer and downsampling. Its expansion path accurately locates and restores the information as much as possible. There is also a shortcut operation before each upsampling convolutional layer to concatenate the information.
To generate high quality image, we adopt the constraints in terms of appearance (e.g., intensity and gradient) as well as motion (e.g., optical flow) losses. Optical flow is a widely used estimator of motion. The Flownet [9] is a pre-trained network used to calculate optical flow. We also clenched the adversarial training to discriminate whether the prediction is real or fake. The aim of our appearance-motion stream is not only to predict frames to be close to their ground truth in spatial space but also to match the optical flow between predicted frames and the ground truth. In common, this stream is expected to associate typical motions to common appearance objects while ignoring the static background patterns.
Given a video with consecutive t frames as { f 1 , f 2 , , f t } . We predict the future video frame f t ^ using previous video frames { f 1 , f 2 , , f t 1 } . Following the work by Mathieu et al. [6], to make the predicted f t ^ close to its ground truth f t , we minimize their distances with reference to intensity and gradient. Following the work of Liu et al. [2], to preserve the temporal coherence between neighboring frames, we enforce the optical flow between f t and f t 1 as well as the optical flow between f t ^ and f t 1 to be close. We assume that normal events can be predicted very well. Therefore, we can include the difference between the predicted frame f t ^ and its ground truth f t for anomaly detection score. Following the work by Liu et al. [2], we employ a traditional U-Net [7], which serves as the main prediction network for a shortcut between a high-level layer and a low-level layer with the output resolution unchanged for each two convolution layers to decrease gradient vanishing and to increase information symmetry. The kernel sizes are configured to all convolution and deconvolution as 3 × 3 , and the max pooling layers as 2 × 2 .

5.1. Appearance Loss

To make the prediction close to its ground truth, following the work of Mathieu et al. [6], intensity and gradient difference can be employed.

5.1.1. Intensity Loss

Intensity loss is the l 1 -norm or l 2 -norm between the predicted frame f t ^ and its ground true f t , which is used to maintain similarity between pixels in the RGB space. By definition, the sum of the absolute values is the l 1 -norm, and the sum of squared values is the l 2 -norm. While the l 1 -norm increases at a constant rate, the l 2 -norm increases exponentially. Minimization of the norm encourages the weights to be small. Specifically, we minimize the distance measured by l 2 -norm between f t ^ and f t as intensity loss L i n t p of the prediction network by Equation (4) [2]:
L i n t p ( f t ^ , f t ) = | | f t ^ f t | | 2 2 .

5.1.2. Gradient Loss

There exists a flaw in calculating pixel intensity loss by l 2 -norm, which produces blur in the output. Henceforth, it is vital to apply gradient difference loss for sharpening the predicted frame f t ^ by using the l 1 -norm. As compared to l 2 -norm, l 2 -norm is more likely to reduce some weights to 0. The gradient loss L g d p of the prediction network can be calculated by Equation (5) as:
L g d p ( f t ^ , f t ) = i , j | f t ^ i , j f t ^ i 1 , j | | f t i , j f t i 1 , j | 1 + | f t ^ i , j f t ^ i , j 1 | | f t i , j f t i , j 1 | 1 ,
where f t i , j denotes the pixel at the i-th row and j-th column in f t , and | . | returns the absolute value.

5.1.3. Motion Loss

To detect anomaly the coherence of motion is an important factor for the evaluation of normal events. Only difference between intensity and gradient for future frame generation cannot guarantee to predict a frame with the correct motion. Optical flow is a good estimator of motion [62]. We adopt a temporal loss defined as the difference between optical flow of predicted frames and ground truth to improve the coherence of motion in the predicted frame. We employ the Flownet [9] denoted as F, which is a CNN-based approach for optical flow estimation. We consider that F is pre-trained on a synthesized dataset [9] and all the parameters in F are fixed. The motion loss L m o t in terms of optical flow can be measured by l 1 -norm using Equation (6) as:
L m o t ( f ^ t , f t , f t 1 ) = | | F ( f ^ t , f t 1 ) F ( f t , f t 1 ) | | 1 .

5.1.4. Adversarial Generator Loss

Usually, a generative adversarial network (GAN) contains a generator G and a discriminator D. The G learns to generate frames that are hard to be classified by D. Similar to Liu et al. [2], we use a U-Net-based prediction network as G. As for D, we follow Isola et al. [63] and utilize a patch discriminator, which means each output scalar of D corresponds a patch of an input image. The goal of training D is to classify f t into class 1 (i.e., genuine label) and G ( f 1 , f 2 , , f t 1 ) = f ^ t into class 0 (i.e., fake label), respectively. The goal of training G is to generate frames, whereas D classify them into class 1. The adversarial generator loss L a d g is minimized to confuse D as much as possible such that it cannot discriminate the generated predictions, and is given by the MSE loss function as:
L a d g ( f t ^ ) = i , j 1 2 L M S E ( D ( f t ^ i , j ) , 1 ) ,
where D ( f i , j ) = 1 denotes real decision by D for patch ( i , j ) , D ( f t ^ i , j ) = 0 indicates fake decision, and L M S E is the mean squared error function.

5.2. Minimization Objective Function

We combine the losses on appearance, motion, and adversarial training to obtain the following minimization objective function:
L ( f t , f t 1 , f ^ t , f ^ t ) = λ i n t p L i n t p ( f t ^ , f t ) + λ g d p L g d p ( f t ^ , f t ) + λ i n t r L i n t r ( f t ^ , f t ) + λ g d r L g d r ( f t ^ , f t ) + λ m o t L m o t ( f ^ t , f t , f t 1 ) + λ a d g L a d g ( f t ^ ) ,
where λ i n t p , λ g d p , λ i n t r , λ g d r , λ m o t , and λ a d g are the corresponding training time weights for the losses. To train the model, the intensity of pixels in all frames can be normalized (e.g., [−1, 1]). An Adam [64]-based stochastic gradient descent method can be used for parameter optimization.

6. Replacement of Traditional U-Net

In this section, we discuss two alternative replacements of basic U-Net.

6.1. Replacing Basic U-Net by Non-Local Block U-Net

The non-local mean value at a given pixel is the weighted average of all pixels in an image, but the kind of weights depend on the likeness between pixels, i.e., similar pixel neighborhoods have bigger weights. For example, considering Figure 4, to calculate the non-local mean at a pixel in Region 1, due to the similarity, the pixels in Region 4 and Region 9 obtain larger weights compared with the rest of seven regions. Similarly, for Region 3, the pixels in Region 5 and Region 10 obtain larger weights than those in the rest of the seven regions, and so on. Thus non-local mean preserves long distance dependence as indicated by arrows.
Buades et al. [12] explained non-local mean operation. Wang et al. [11] proposed a generic non-local operation as:
y i = 1 C ( x ) j f ( x i , x j ) g ( x j ) ,
where x and y denote the input and output signals, respectively. Here, i and j indicate the index of an output position in space-time and the index of enumerating all possible positions, respectively. The pairwise function f determines the scalar between i and all j, while the unary function g computes a representation of the input signal at j. In the end, y is obtained following a normalization by the factor of C ( x ) .
Wang et al. [11] also wrapped the non-local operation shown in Equation (9) into a non-local block that can be embedded in many existing pre-trained networks including U-Net [7] without affecting its standard behavior. Unlike fully connected layers that are frequently used at the end, a non-local block can be added into the earlier part of deep neural networks—resulting a combination of both local and non-local information in an ample hierarchy. A non-local block can be defined by Equation (10) as [11]:
z i = W z y j + x i ,
where W z belongs to a weight matrix and “+” denotes a residual connection.
Figure 5 depicts a space-time non-local block [11] with the embedded Gaussian. The input feature maps are presented as their tensors with the shape of T × H × W × C , i.e., the input dimension of X is T × H × W × C . The green colored boxes indicate 1 × 1 × 1 convolution. This space-time non-local block is similar to the block in the architecture of ResNet [60]. So, the non-local operation can be easily inserted into the existing network structure. However, the convolution is performed using a convolution kernel with a size of 1 × 1 × 1 to obtain the outputs of three branches ( θ , φ , and g ) with the dimension of T × H × W × ( C / 2 ) . Afterwards, three outputs of these branches with dimension of T H W × ( C / 2 ) are obtained through tensor to matrix conversion process. The output of the φ branch is transposed, and then this output and the output of the θ branch are multiplied using matrix multiplication rule to obtain the output dimension of T H W × T H W . Subsequently, the SoftMax operation is performed on each row. Later, the matrix multiplication with the output of the g branch is performed to obtain the output dimension of T H W × ( C / 2 ) . A reshaping of T H W × ( C / 2 ) is carried out through matrix to tensor conversion process for getting the output dimension of T × H × W × ( C / 2 ) . The output dimension of T × H × W × C from the 1 × 1 × 1 convolution layer and the original input dimension of T × H × W × C perform element-wise summation to achieve the final output Z. This element-wise summation is similar to the residual connection in the ResNet [60]. Two optimization techniques are applied to improve the computational efficiency of the non-local block: (1) The number of convolution kernels for θ , φ , and g operations is set to the half of the number of input feature map channels (i.e., C / 2 ); (2) The pooling method is applied to sample the output of θ , φ , and g , so that the size of the feature map output is reduced to half of the original.
Similar to the traditional U-Net [7], the non-local block U-Net contains both contracting and expanding paths. There are some advantages to use non-local block in the U-Net [7] including the non-local operations that can directly capture remote dependencies and can also improve the correlation of distant pixels for gaining a richer feature map. However, the usage of non-local block in U-Net [7] for detecting video anomaly is not new. For example, Zhang et al. [13] used three non-local blocks in their U-Net frame prediction model for detecting surveillance video anomaly. Nevertheless, Wang et al. [11] suggested that more non-local blocks lead to better results. To this end, we propose to employ four non-local blocks in the U-Net architecture for our prediction network. Figure 6 shows our adopted non-local block U-Net. Basically, it consists of a traditional U-Net [7] and four non-local blocks. Those non-local blocks are added in downsampling. On the whole, the contracting path of our non-local U-Net extracts features through the convolutional layer and downsampling; while the expanding path precisely pinpoints and restores the information to the greatest extent. There are also skip layers to fuse the information. In the contracting path, 3 × 3 convolution followed by ReLU activation and 2 × 2 maximum pooling layers are applied. A  2 × 2 maximum pooling layer is added after every two convolutional layers (e.g., the 2 × 2 maximum pooling layer between Layer 1 and Layer 2). Every step of upsampling in the expanding path bridges to the contracting path for fabricating high-quality images.

6.2. Replacing Basic U-Net by aUnet

Attention mechanism contributes to suppress less relevant features and emphasizing more important features in image classification. Commonly, attention in deep neural networks is mainly implemented in two forms, namely, hard (or stochastic) attention and soft (or deterministic) attention. The implementation of hard attention is non-differentiable [65], whereas soft attention models are differentiable [66]. Thus, the soft attention is a preferable form of implementation. Roughly, there exist two types of soft-attention-based models: (i) Usage of intermediate layers of the architecture, and (ii) Usage of beginning layers of the architecture. For example, for image classification, Jetley et al. [67] introduced attention gates at three intermediate layers in a VGG network and a weighted combination of the attention maps was employed in the last layer. Oktay et al. [14] introduced attention gates in a U-Net architecture for segmentation of the pancreas. In both models, the attention blocks employ activation maps from the intermediate layers in the model as saliency maps for enhancing the discriminative characteristics of extracted intermediary features. However, Vakanski et al. [15] claimed that the segmentation performance would not be improved using the self-attention blocks described in Jetley et al. [67] and Oktay et al. [14]. Thus, they applied the attention blocks at beginning layers of their architecture for breast tumor segmentation in ultrasound images. Basically, their proposed attention block utilized pre-computed saliency maps that specified to target spatial regions.
Our design of the attention blocks in U-Net was inspired by the attention blocks of Vakanski et al. [15]. Differently from their network, our proposed attention blocks in this work utilizes motion saliency maps that point out to target salient regions of motion. Further, there are some internal architectural differences. For example, the pre-computed input spatial salient map of Vakanski et al. [15] is down-sampled through a max-pooling layer following the standard Equation (11) as:
β = γ κ + 2 ρ s + 1 ,
where γ , β , κ , ρ , and s indicate number of input features, the number of output features, convolution kernel size, convolution padding size, and convolution stride size, respectively. We also follow Equation (11), but our instantaneously computed motion saliency map is down-sampled through a 1 × 1 convolution followed by ReLU activation and 2 × 2 stride operation. A graphical representation of our proposed aU-Net is presented in Figure 2.

6.2.1. Operation of the aUnet

Essentially, our proposed aU-Net in Figure 2 consists of a standard U-Net, a motion saliency map, and τ number of attention blocks with τ { 1 , 2 , 3 , 4 } . The discussion of motion saliency map covers in the next subsection. However, the input feature map is down-sampled through a 2 × 2 max-pooling layer and then fed to the attention block. The motion saliency map is fed to the τ th attention block with horizontal and vertical spatial dimensions of 256 / 2 τ 1 × 256 / 2 τ 1 pixels. This feeding is performed directly for the first attention block, but for other attention blocks indirectly via their preceding attention blocks. At the τ th attention block, the motion saliency map with horizontal and vertical spatial dimensions of 256 / 2 τ 1 × 256 / 2 τ 1 pixels is passed through 1 × 1 convolution layer followed by ReLU activation, 2 × 2 stride layer, and 128 number of filters. After 2 × 2 max-pooling, the input feature map at the τ th attention block with horizontal and vertical spatial dimensions of 256 / 2 τ × 256 / 2 τ pixels is also passed through 1 × 1 convolution layer followed by ReLU activation, 2 × 2 stride layer, and 128 number of filters. The spatial dimensions of both input feature map with size of 256 / 2 τ × 256 / 2 τ × 128 and the motion saliency map with size of 256 / 2 τ × 256 / 2 τ × 128 match, and then they perform a summation at an element-wise sum block. Its output is an intermediate feature map with size of 256 / 2 τ × 256 / 2 τ × 128 . This map is further refined through a series of linear 3 × 3 × 128 and 1 × 1 × convolution layers followed by ReLU activation. A sigmoid activation function normalizes the values into the range of [ 0 , 1 ] and outputs a semi-attention map with a spatial size of 256 / 2 τ × 256 / 2 τ × 1 . This semi-attention map with size of 256 / 2 τ × 256 / 2 τ × 1 and the max-pooled feature map with size of 256 / 2 τ × 256 / 2 τ × ς perform a multiplication at an element-wise product block, where ς = 32 and ς = 64 for the first-second and third-fourth attention blocks, respectively. Its output is an attention map with size of 256 / 2 τ × 256 / 2 τ × ς , which is propagated to the next layer of the standard U-Net for further processing.

6.2.2. Motion Saliency Map

Normally, the human vision system pays more attention to the moving objects than the static regions. For this reason, motion becomes one of the key features of the visual attention model. Due to the elapse of time, an attention region on a frame becomes inattention region. We can define such phenomenon using a decay attention factor d e A t t with 1 d e A t t 255  as:
ϖ = 255 d e A t t ,
where ϖ indicates the number of attention frames for a region (i.e., ϖ frames later an attention region becomes a background region). For example, all current motion regions are paying maximum attention, but with d e A t t = 60 all such regions become zero attention regions after ϖ = 255 / 60 = 4 frames. Normally, it is not important to process all the regions in a frame. To speedup computation, we can obtain a region of interest (RoI) obtained by a motion heat map [68,69,70,71,72] to apply on the calculation of motion saliency map. Figure 7 indicates a straightforward RoI.

6.2.3. Algorithm

Algorithm 1 gives details of our motion saliency map creation algorithm. It assumes the foreground information of the current frame as the most salient feature.
Algorithm 1: Creation of Motion Saliency Map
Electronics 12 01517 i001
Based on d e A t t values, the motion saliency map typically comprises of multiple attention maps with different resolutions, thereby capturing salient features across multiple levels of feature abstraction. Figure 8 shows some sample camera view frames from UCSD-Ped2 [31] dataset. Figure 9 shows two samples output of Algorithm 1 considering frames in Figure 8 and d e A t t = 60 .

7. Anomaly Detection on Testing Data

If we assume that normal events can be well predicted, then we can easily apply the difference between the predicted frame f t ^ and its ground truth f t for anomaly prediction. In anomaly detection methods, two common metrics, namely MSE and PSNR, are widely employed to calculate the anomaly scores. The MSE is used to measure the quality of predicted images by computing a Euclidean distance between the prediction and its ground truth of all pixels, whereas the PSNR represents a measure of the peak error. The MSE is easy to compute, but sensitive to outliers. On the other hand, in the absence of error, if two images f t and f t ^ (or f t ^ ) are identical, then the MSE is zero but the PSNR becomes infinite (or division by zero) [73]. In spite of that, Mathieu et al. [6] showed that PSNR is a better way for image quality assessment.
We assume that if any frame f t holds an appearance anomaly (e.g., someone carrying a gun) then the rNet can improve its determinability, whereas if f t contains a motion anomaly (e.g., people fighting on the street) the pNet can improve its determinability. Therefore, we bring the error scores of appearance and prediction into a cascaded score to compute the final error score of each frame for detecting its anomalousness. We evaluate the anomaly of appearance based on reconstruction error of the entire frame. This technique preserves the complete appearance of target objects in frame. We define pixel-wise partial anomaly score individually estimated on the prediction error of P S N R p and the reconstruction error of P S N R r from prediction and reconstruction networks, respectively, sharing for the same frame as:
P S N R r = 20 l o g 10 255 1 W H x = 1 W y = 1 H f t x , y f t ^ x , y 2
P S N R p = 20 l o g 10 255 1 W H x = 1 W y = 1 H f t x , y f t ^ x , y 2 ,
where W, H, ( x , y ) are the width, height, and spatial index of the frame, respectively. The maximum pixel value of an image is 255. Large P S N R r or P S N R p of a frame hints that it is more likely to be normal. Roughly, it is possible to use P S N R r or P S N R p for determining whether an abnormal event has occurred. For example, if P S N R r or P S N R p is greater than any defined threshold, the frame is normal, otherwise abnormal. Nevertheless, it expects more refinement for better performance.
The partial frame-level score of the t-th frame S p a r t ( t ) is computed as a weighted combination of the two incomplete scores as follows:
S p a r t ( t ) = ( σ 1 ) ( ω 1 ) ( P S N R r ) + ( σ 2 ) ( ω 2 ) ( P S N R p ) ,
where ω 1 and ω 2 are the weights, which normalize the two scores to the same scale. They can be calculated on the training data of n images using Equations (16) and (17) as:
ω 1 = 1 10 l o g 10 1 n i = 1 n P S N R r i
ω 2 = 1 10 l o g 10 1 n i = 1 n P S N R p i .
The hyper parameters of σ 1 > 0 and σ 2 > 0 are used to control the contribution of corresponding score to the summation, which can be adjusted appropriately for the importance of the appearance and motion. We perform a normalization of S p a r t ( t ) using Equation (18) as:
S n o r m ( t ) = e S p a r t ( t ) λ ν ,
where ν > 0 and λ > 0 belong to shape and scale parameters, respectively [74]. The occurrence of abnormal events in video has continuity, i.e., abnormal events cannot appear in a single frame, but appear in multiple consecutive frames. Consequently, we utilize not only the current frame but also the past and future frames to compute the final anomaly score using Equation (19) as:
S f r a m e ( t ) = 1 η 2 i = 0 η ( η i ) ( S n o r m ( t ± i ) ) ,
where the anomaly score of the t-th frame S f r a m e ( t ) consists of the S n o r m ( t ) as current frame and the S n o r m ( t ± i ) with i = 1 , 2 , , η of η past and future frames. The score of S f r a m e ( t ) estimated from a frame of abnormal event is expected to be higher compared with the ones of normal event. Therefore, we can predict whether a frame is normal or abnormal based on S f r a m e ( t ) . One can set a threshold to distinguish normal or abnormal frames.

8. Larger Error Gap Guaranteed by rpNet

Ideally, both pNet and rNet can produce their own outputs. We assume that the output of either pNet or rNet can individually provide necessary anomaly scores, but may not provide sufficient anomaly scores used for anomaly detection. The gain of the rpNet individually relies on pNet and rNet. The overall gain of the rpNet equals to the product of the individual gain of pNet and rNet. Mathematically, if G 1 and G 2 indicate the gains of pNet and rNet, respectively, then the overall gain G o v e r a l l can be formulated by Equation (20) as:
G o v e r a l l = ( G 1 ) ( G 2 ) .
When the gain of pNet and rNet applies the decibel (dB) expression, the Equation (20) yields:
(21) l o g ( G o v e r a l l ) = l o g ( ( G 1 ) ( G 2 ) ) (22)        = l o g ( G 1 ) + l o g ( G 2 )
G o v e r a l l dB = G 1 dB + G 2 dB .
For example, Figure 10 conveys a simplified schematic diagram of the rpNet along with any instance of video frames if pNet and rNet achieve 41.44 dB and 40.80 dB, respectively, then the overall process has a gain of 82.24 dB.
Using a simple simulation, we wish to explain that the rpNet can provide better anomalous detection results by providing higher anomaly scores for abnormal cases in videos than that of either pNet or rNet individually. Explicitly, the rpNet can provide an improved reconstruction error gap by increasing the output signal strength of pNet and rNet.
Assume that a hypothetical video surveillance system has captured the following four scenarios of people: (i) Normal walk and gather but sudden evacuation after an unwanted event, (ii) normal walk and sudden split after an incident, (iii) someone intentionally passing opposite of the main stream, and (iv) sudden run after an explosion. In addition, assume that both pNet and rNet are trained with a normal video cases and can detect those abnormal video events by providing the anomaly scores as depicted in Figure 11. The ground truths for four scenarios are given, but the anomaly scores of the rpNet are calculated.
Table 2 shows the analyzing report of Figure 11 in qualitatively and quantitatively. The mean ACC scores of pNet and rNet are 0.7740 and 0.8762, respectively. The mean ACC of the rpNet is 0.9595, which is definitely higher than those scores. To gain such ACC score, the rpNet has to come up against a mean false alarm rate of 0.0313. Nevertheless, on the average, the rpNet achieves 16.74% better ACC score than the mean ACC score of the pNet and rNet. At the rising edge, the values of root MSE (RMSE) are 15.0416, 6.4226, and 3.2404 for the rNet, pNet, and rpNet, respectively. The RMSE is 10.7321 / 3.2404 = 3.312 times less in the rpNet compared with the mean RMSE of rNet and pNet. Similarly, at the falling edge, the RMSE is 13.9240 / 2.6926 = 5.1712 times less in the rpNet. The coefficient of variation of the RMSE, denoted as CV(RMSE), is 0.0038 / 0.0012 = 3.1667 and 0.0047 / 0.0009 = 5.2222 times less in the rpNet at rising and falling edges, respectively, compared with the mean CV(RMSE) of rNet and pNet.
Taking into account the data in Table 2, upon ROC curve analysis the scores of 0.674, 0.841, and 0.973 can be obtained from rNet, pNet, and rpNet, respectively. From Figure 12, it is noticeable that the rpNet became the highest performative model considering data in Table 2. The rpNet achieves 1.3002 times or 30.02% better AUC scores than the mean AUC score of rNet and pNet. Explicitly, the simulated events in Figure 11 show evidence that the rpNet can guarantee larger error gap on the identical ground of both rNet and pNet. This proposition is also supported by the practical results from the experimental setup.
In essence, the aforementioned straightforward simulation shows that the rpNet is capable of achieving certain incremental factor of the reconstruction error gap by increasing the signal strength of the anomaly scores.

9. Experimental Setup and Results

Our implementation was performed by Python based on the TensorFlow framework [75]. Both training and evaluation of the model were performed on an Intel ® Core T M i7-7800X CPU @3.50 GHz along with NVIDIA’s graphics card GeForce GTX 1080. We used the Adam optimizer [64] for training and set the learning rate to 0.0001 and 0.00001 for the generator and discriminator, respectively. The input images are resized to 256 × 256 pixels and converted to gray-scale. We trained our model using five publicly available datasets, as illustrated in Table 3, namely UCSD-Ped1 [31], UCSD-Ped2 [31], CUHK-Avenue [32], ShanghaiTech-Campus [18], and UMN [36] datasets with normal events. For evaluation, we used both normal and abnormal frames of those datasets. The training procedure was iterated up to a maximum of 100 epochs. The batch size was set to 4. AUC metric was used to evaluate the overall model performance.
As our methodology possesses six combinational models, namely AE-Unet (Ours), AEcUnet (Ours), AEnUnet (Ours), AEcnUnet (Ours), AEaUnet (Ours), and AEcaUnet (Ours), we conduct experiment each of them individually. Table 4 lists miscellaneous parameter values used during experiments. Figure 13, Figure 14, Figure 15, Figure 16 and Figure 17 demonstrate sample results of AEcaUnet (Ours) using parameters in Table 4. For a better visualization, the rectangles on camera view images were highlighted manually. The pink region indicates the ground truth of abnormal events. It is observable that the partial results of prediction network are superior to that of reconstruction network. This is due to the fact that the prediction network is capable of being extracted for better quality of features from the available videos. However, the partial results of both networks contribute as a complement towards the final performance of each model by confirming certain degree of augmentation of the reconstruction error gap.

10. Experimental Result Comparison

In the literature, there are widely used common datasets that are used to test the performance of different deep models, while other datasets were mainly used to test the generalization ability of those models for detecting crowd anomaly in video streams. Table 5 compares frame-level AUC scores among miscellaneous methods and the most frequently used crowd datasets.
From Table 5, it is notable that our method could not demonstrate an outright accuracy score. However, from Table 5, it is hard to notice the best performative method as an individual method could not achieve an absolute better performance. For example, Mu et al. [109], Cho et al. [131], Xia et al. [104], Zahid et al. [87], and Roy et al. [91] achieved the best AUC scores of 0.952, 0.992, 0.922, 0.940, and 0.997 from UCSD-Ped1 [31], UCSD-Ped2 [31], CUHK-Avenue [32], ShanghaiTech-Campus [18], and UMN [36], respectively. Unambiguously, considering experimental results in Table 5, it is very hard to find that one algorithm is better than its alternatives. Usually, the nonparametric statistical analysis can be used for superiority measure [134], but all models were not tested against always the same five datasets in Table 5. Henceforth, based on the chosen datasets by the authors of various models in Table 5, mainly for statistical analysis, we can divide the tabular data in Table 5 into six following groups:
G 1
Methods of this group were tested against the datasets of UCSD-Ped1 [31], UCSD-Ped2 [31], CUHK-Avenue [32], ShanghaiTech-Campus [18], and UMN [36] or the methods existed before 2020 (i.e., Table 6).
G 2
Methods of this group were tested against the datasets of UCSD-Ped2 [31], CUHK-Avenue [32], and ShanghaiTech-Campus [18] (i.e., Table 7).
G 3
Methods of this group were tested against the datasets of UCSD-Ped1 [31], UCSD-Ped2 [31], and CUHK-Avenue [32] (i.e., Table 8).
G 4
Methods of this group were tested against the datasets of UCSD-Ped1 [31], UCSD-Ped2 [31], CUHK-Avenue [32], and ShanghaiTech-Campus [18] (i.e., Table 9).
G 5
Methods of this group were tested against the datasets of UCSD-Ped1 [31], UCSD-Ped2 [31], CUHK-Avenue [32], and UMN [36] (i.e., Table 10).
G 6
Methods of this group were tested against the datasets of UCSD-Ped1 [31], UCSD-Ped2 [31], and UMN [36] (i.e., Table 11).
The frame-level failure score of AUC (fAUC) is defined by Equation (24) as:
f A U C = 1 A U C .
Table 6. The fAUC scores of G 1 . Column-wise the best numerical result is shown in bold.
Table 6. The fAUC scores of G 1 . Column-wise the best numerical result is shown in bold.
ModelsObtained fAUC Scores from Different DatasetsMean of fAUC Scores
Ped1 [31]Ped2 [31]Avenue [32]Campus [18]UMN [36]ArithmeticGeometricHarmonic
Zhang et al. [127]0.05800.07100.19500.19700.01200.10660.07170.0400
Roy et al. [91]0.15000.02500.13000.19000.00300.09960.04880.0127
Liu et al. [2]0.16900.04600.15100.2720-0.15950.13370.1054
Hasan  et al. [1]0.25000.15000.20000.3910-0.24780.23270.2195
LuoLG [78]0.24500.11900.2300--0.19800.18860.1782
Luo  et al. [18]-0.07800.18300.3200-0.19370.16590.1401
Nguyen  et al. [27]-0.03800.1310--0.08450.07060.0589
Ionescu et al. [22]0.31600.17800.1940--0.22930.22180.2153
AE-Unet (Ours)0.15200.09800.17500.26600.07000.15220.13720.1233
AEcUnet (Ours)0.13800.06600.13700.23900.03500.12300.10090.0801
AEnUnet (Ours)0.12800.04300.12900.22600.02300.10980.08190.0577
AEcnUnet (Ours)0.11200.02900.12600.21800.02400.10180.07350.0512
AEaUnet (Ours)0.12500.03100.11300.22000.02000.10180.07190.0482
AEcaUnet (Ours)0.08200.01100.08400.20200.01300.07840.04570.0254
Table 6 presents the fAUC scores of G 1 group with related evaluation. Although many methods are related to this group, rigorous statistical analysis is very difficult to perform. For example, the method of Nguyen et al. [27] was only tested on two datasets, whereas the method of Zhang et al. [127] was tested on five datasets. Thus, instead of using rigorous statistical analysis, for evaluation we use arithmetic, geometric, and harmonic means only. The method of Zhang et al. [127] presented the best performance from UCSD-Ped1 [31], whereas AEcaUnet (Ours) demonstrated the best performance from UCSD-Ped2 [31] and CUHK-Avenue [32]. The method of Roy et al. [91] showed slightly better performance from UMN [36]. However, methods of Zhang et al. [127], Roy et al. [91], and AEcaUnet (Ours) showed approximately the same performance from ShanghaiTech-Campus [18]. Nevertheless, the overall performance of AEcaUnet (Ours) is better than that of either Roy et al. [91] or Zhang et al. [127]. Explicitly, by referring to Table 6, AEcaUnet (Ours) seemingly showed the best performance from G 1 .
For G 2 , G 3 , G 4 , and G 5 , on the other hand, it is not show any direct indication of superiority. As they contain necessary and sufficient different data, we perform nonparametric statistical analysis to measure the superiority among models.

11. Nonparametric Statistical Analysis

Friedman test [135] and its derivatives (e.g., Iman-Davenport test [136]) are commonly referred to as one of the most popular nonparametric tests for multiple comparisons [137]. The mathematical equations of Friedman [135], aligned Friedman [138], and Quade [139] tests can be found in Quade [139] and Westfall et al. [140]. While Friedman test [135] takes measures in preparation for ranking of a set of algorithms with performance in descending order, both aligned Friedman [138] and Quade [139]) tests can give us additional information. On the other hand, Nemenyi [141] test has a unique advantage of having an associated plot to demonstrate the results of fair comparison. If the distance between algorithms is less than the Nemenyi [141] post hoc critical distance, then there is no statistically significant difference between them. Usually, confidence limits of 90% or 95% can be used to support the claims on the superiority of models. However, we perform Friedman [135], aligned Friedman [138], and Quade [139] tests for average rankings as well as Nemenyi [141] post hoc critical distance diagram (CDD) for validating fair comparisons.

11.1. Average Ranking of G 2

By viewing of fAUC values in Table 7, it is clear that Cho et al. [131], Zhou et al. [125], and Zahid et al. [87] showed the best performance from the datasets of UCSD-Ped2 [31], CUHK-Avenue [32], and ShanghaiTech-Campus [18], respectively, in their associated experimental setups. In addition, Vu et al. [108], AEcaUnet (Ours), and Cho et al. [131] obtained the best fAUC arithmetic, geometric, and harmonic means, respectively. The tests of Friedman [135], aligned Friedman [138], and Quade [139] have been applied to the fAUC scores in Table 7 for obtaining the average ranking of each model. The obtained average ranking results have been recorded in Table 7 (right part) too. The average ranks obtained by each method in the Friedman [135] test were considered Friedman statistic (distributed according to chi-square with 40 degrees of freedom) of 140.663182 along with computed p-value of 0.0000000001. The average ranks obtained by each method in the aligned Friedman [138] test were considered the aligned Friedman statistic (distributed according to chi-square with 40 degrees of freedom) of 126.3223 along with computed p-value of 0.000000000137. The average ranks obtained by each method in the Quade [139] test were considered Quade statistic (distributed according to F-distribution with 40 and 200 degrees of freedom) of 3.123448 along with computed p-value of 0.000000075521.
From the Friedman [135] test, AEcaUnet (Ours) obtained the first best rank with the score of 03.3333, whereas Vu et al. [108], Cho et al. [131], and Zhou et al. [125] obtained the second, third, and forth best ranks with the scores of 06.5000, 07.3333, and 07.5833, respectively. Similarly, from the aligned Friedman [138] test, AEcaUnet (Ours) achieved the first best rank with the score of 17.3333, whereas Vu et al. [108] obtained the second best rank scoring of 30. From the Quade [139] test, AEcaUnet (Ours) secured the first best rank with the score of 4.0476, whereas Vu et al. [108] obtained the second best rank having score of 8. On the average of ranking, in group G 2 , our proposed method AEcaUnet (Ours) outperformed its alternative methods e.g., Vu et al. [108], Cho et al. [131], Zhou et al. [125], Roy et al. [91], Mu et al. [109], Wu et al. [107], Zahid et al. [87], and etc.
Table 7. Multiple comparison test for G 2 using fAUC. Column-wise the best numerical result is shown in bold.
Table 7. Multiple comparison test for G 2 using fAUC. Column-wise the best numerical result is shown in bold.
ModelsExperimental Results AnalysisStatistically Analysis of Experimental Results
fAUC Scores from DatasetsMean of fAUC ScoresAverage Ranking
Ped2 [31]Av. [32]Cam. [18]ArithmeticGeometricHarmonicFr. [135]A. Fr. [138]Q. [139]
WangCYJT [79]0.03700.11700.23400.12930.10040.075317.6667105.000017.4286
Dong et al. [81]0.04400.15100.26300.15270.12040.090530.9167176.583329.9048
Zahid et al. [87]0.21100.25000.06000.17370.14680.118133.3333192.166728.8571
Doshi et al. [89]0.02200.13600.28400.14730.09470.053319.7500117.916721.2857
Roy et al. [91]0.02500.13000.19000.11500.08520.056709.3333054.000009.5238
Lu et al. [94]0.03800.14200.22100.13370.10600.079221.5000122.500020.5238
Tang et al. [96]0.04000.16000.28000.16000.12150.086232.6667180.833332.1190
Lee et al. [99]0.03400.10000.23800.12400.09320.068813.5000085.833314.4286
Song et al. [101]0.09300.10800.30000.16700.14440.128533.8333195.166733.2381
Sun et al. [103]0.09000.11100.07800.09300.09200.091116.3333090.833317.2619
Feng et al. [105]0.03000.14000.22300.13100.09780.066717.5833101.583317.3095
Zhang et al. [106]0.04600.13200.26400.14730.11700.090628.5000164.166728.2857
Wu et al. [107]0.01200.15300.27200.14570.07930.032117.6667108.000019.7143
Vu et al. [108]0.04000.08000.06300.06100.05860.056206.5000030.000008.0000
Mu et al. [109]0.05300.10300.07900.07830.07560.072812.0000056.000013.9048
LiLS [110]0.04500.10900.26000.13800.10840.085121.8333134.000021.5952
Cai et al. [112]0.03200.12700.25800.13900.10160.069819.1667123.000018.9048
Doshi et al. [26]0.02800.13600.29100.15170.10350.064524.0000136.166724.7857
Zhong et al. [3]0.02300.11100.29300.14230.09080.053716.0000097.166718.5238
Chang  et al. [116]0.03300.12900.26300.14170.10380.071721.8333131.166721.7619
Esquivel et al. [117]0.13000.17000.13000.14330.14220.141131.1667180.000028.1905
Park et al. [118]0.04000.15000.28000.15670.11890.085131.1667175.833331.0238
Doshi et al. [119]0.03000.11300.26400.13570.09640.065317.1667104.333318.3571
Li et al. [120]0.02900.13400.21800.12700.09460.064514.1667085.500014.0000
Hao et al. [121]0.03100.13400.26200.14230.10290.068921.5000129.833321.3095
Zhang et al. [13]0.04100.14800.27300.15400.11830.086230.5000173.666730.4048
Shao et al. [123]0.05100.14700.28300.16030.12850.100235.0000192.166734.8095
Zou et al. [124]0.02700.12800.27300.14270.09810.061019.2500117.583319.9524
Zhou et al. [125]0.02600.07400.25100.11700.07850.053607.5833060.416709.3095
Zhang et al. [127]0.07100.19500.19700.15430.13970.123532.0000182.000028.6190
Liu et al. [129]0.01900.10200.26200.12770.07980.045309.2500068.583310.6667
Cho et al. [131]0.00800.12000.23700.12170.06100.021807.3333056.333308.7143
ParkLCL [132]0.04200.14600.27600.15470.11920.087531.5000175.333331.1905
Le et al. [133]0.02600.13300.26400.14100.09700.060317.9167115.416718.6429
AE-Unet (Ours)0.09800.17500.26600.17970.16580.152538.5000230.333337.0000
AEcUnet (Ours)0.06600.13700.23900.14730.12930.112630.3333180.166728.4286
AEnUnet (Ours)0.04300.12900.22600.13270.10780.084721.2500125.750020.6190
AEcnUnet (Ours)0.02900.12600.21800.12430.09270.063812.0000077.500011.9762
AEaUnet (Ours)0.03100.11300.22000.12130.09170.065712.1667069.500012.8810
AEcaUnet (Ours)0.01100.08400.20200.09900.05710.027803.3333017.333304.0476

11.2. Validation of Fair Comparisons for G 2

Figure 18 depicts the Nemenyi [141] post hoc critical distance diagrams at the level of significance α = 0.05 using fAUC scores in Table 7. Hereby, we define the hypothesis as “the difference is significant”. From Figure 18, it is noticeable that the distance between the hypothesis of AEcaUnet (Ours) vs. Zhang et al. [13] is | 31.1667 3.3333 | = 27.8334 (heavy pink line), which is greater than the Nemenyi [141] post hoc critical distance of 26.242 (heavy red line) at α = 0.05 (i.e., 95% confidence limit). Consequently, they are statistically significant as their distance difference excesses by a numerical value of | 27.8334 26.242 | = 1.5914 . Similarly, another 19 hypotheses on the differences of this G 2 group are statistically significant, as their distance differences are greater than 26.242 at 95% confidence limit. Yet, other hypotheses on the differences of this G 2 group are not statistically significant as their distance differences are less than 26.242. For example, the hypothesis on the difference of Mu et al. [109] vs. AE-Unet (Ours) is not statistically significant as their distance difference lacks by a numerical value of more than | 26.242 + 11.8333 37.5 | = 0.5753 . However, the performance of the method of AEcaUnet (Ours) is remarkably different from AEcUnet (Ours), Zhang et al. [13], Dong et al. [81], Esquivel et al. [117], Park et al. [118], ParkLCL [132], Zhang et al. [106], Tang et al. [96], Zahid et al. [87], Song et al. [101], Shao et al. [123], and AE-Unet (Ours). On the same ground, the models of Vu et al. [108], Cho et al. [131], and Zhou et al. [125] are remarkably different from Shao et al. [123] and AE-Unet (Ours) only. Henceforth, in group G 2 at confidence limit of 95%, AEcaUnet (Ours) outperforms Vu et al. [108], Cho et al. [131], Zhou et al. [125], Liu et al. [129], Roy et al. [91], and etc., which also agrees with the average ranking of aligned Friedman [138] and Quade [139] shown in Table 7.

11.3. Average Ranking of G 3

Observing fAUC values in Table 8, it is clear that Mu et al. [109], Wang et al. [84], and Xia et al. [104] showed the best performance for the datasets of UCSD-Ped1 [31], UCSD-Ped2 [31], and CUHK-Avenue [32], respectively, in their associated experimental setups. Moreover, AEcaUnet (Ours) obtained the best fAUC arithmetic and geometric means from our experimental setup, whereas Wang et al. [84] the best fAUC harmonic mean. The tests of Friedman [135], aligned Friedman [138], and Quade [139] have been applied to the fAUC scores in Table 8 for obtaining the average ranking of each model. The obtained average ranking results have been recorded in Table 8 (right part) too. The average ranks obtained by each method in the Friedman [135] test were considered Friedman statistic (distributed according to chi-square with 45 degrees of freedom) of 185.858927 along with computed p-value of 0.000000000001. The average ranks obtained by each method in the aligned Friedman [138] test were considered aligned Friedman statistic (distributed according to chi-square with 45 degrees of freedom) of 180.364336 along with computed p-value of 0.000000000091. The average ranks obtained by each method in the Quade [139] test were considered Quade statistic (distributed according to F-distribution with 45 and 225 degrees of freedom) of 6.597079 along with computed p-value of 0.0000000001.
From the Friedman [135] test, AEcaUnet (Ours) achieved the first best rank with the score of 2.5, whereas Wang et al. [84] and Xia et al. [104] obtained the second and third best ranks by securing scores of 4.8333 and 6.8333, respectively. Similarly, from the aligned Friedman [138] test, AEcaUnet (Ours) gained again the first best rank with the score of 12.3333, whereas Wang et al. [84] and Xia et al. [104] obtained the second and third best ranks by securing scores of 31.3333 and 36.6667, respectively, and etc. From the Quade [139] test, AEcaUnet (Ours) secured the first best rank with the score of 2.8571, whereas Mu et al. [109], Wang et al. [84], and AEcnUnet (Ours) obtained other successive best ranks, and etc. While simple average failed to show the superiority, AEcaUnet (Ours) obtained the first best result from the Friedman [135], the aligned Friedman [138], and the Quade [139] tests. Statistically, among all samples of experimental results in group G 3 (Table 8), the method of AEcaUnet (Ours) outperformed its alternative methods (e.g., Wang et al. [84], Mu et al. [109], Xia et al. [104], and etc.).
Table 8. Multiple comparison test for G 3 using fAUC. Column-wise the best numerical result is shown in bold.
Table 8. Multiple comparison test for G 3 using fAUC. Column-wise the best numerical result is shown in bold.
ModelsExperimental Results AnalysisStatistically Analysis of Experimental Results
fAUC Scores from DatasetsMean of fAUC ScoresAverage Ranking
Ped1 [31]Ped2 [31]Av. [32]ArithmeticGeometricHarmonicF. [135]A. F. [138]Q. [139]
WangCYJT [79]0.16600.03700.11700.10670.08960.072118.8333112.333321.3810
Chen et al. [80]0.12800.03500.12700.09670.08290.067812.7500077.583312.8571
Fan et al. [82]0.05100.07800.16600.09830.08710.078019.0000122.833314.0000
Nawar. et al. [83]0.24800.08900.23200.18970.17240.153241.8333253.166741.9524
Wang et al. [84]0.13300.00900.10100.08100.04940.023304.8333031.333306.3333
WuLLSS [85]0.17600.07200.14500.13100.12250.113335.1667207.166736.5714
Yang et al. [86]0.06500.06300.16800.09870.08830.080620.4167122.750016.4524
Zahid et al. [87]0.41500.21100.25000.29200.27970.269146.0000273.166746.0000
Zhou et al. [88]0.16100.04000.14000.11370.09660.078225.5000146.666726.1429
Roy et al. [91]0.15000.02500.13000.10170.07870.055213.6667087.500014.2381
Ji et al. [93]0.16000.02000.22000.13330.08900.049322.0833136.083319.7143
Lu et al. [94]0.13700.03800.14200.10570.09040.073819.6667114.500018.6667
Rama. et al. [95]0.14000.06000.12800.10930.10240.094924.8333150.833325.6190
Tang et al. [96]0.17000.04000.16000.12330.10290.080830.0833167.083330.7143
Almaz. et al. [97]0.06300.16700.12500.11830.10960.100527.5000154.833325.7619
Wu0S [98]0.17000.04000.13000.11330.09600.077823.7500139.416725.6190
Prawiro et al. [100]0.16000.04000.14000.11330.09640.078124.5000144.833324.9286
Song et al. [101]0.09500.09300.10800.09870.09840.098220.8333119.500020.9762
Yan et al. [102]0.25000.09000.20400.18130.16620.149941.2500247.750041.5238
Sun et al. [103]0.09800.09000.11100.09970.09930.098921.8333122.333321.9762
Xia et al. [104]0.12000.03400.07800.07730.06830.059306.8333036.666708.0000
Wu et al. [107]0.11500.01200.15300.09330.05950.030410.0833059.750007.4048
Vu et al. [108]0.15000.04000.08000.09000.07830.067912.5000076.000015.0000
Mu et al. [109]0.04800.05300.10300.06800.06400.060708.0000046.333306.2857
LiLS [110]0.14700.04500.10900.10030.08970.078518.0833104.750019.8095
LiCL [111]0.09500.07100.16500.11030.10360.097827.3333156.666724.2857
Sayp. et al. [113]0.14700.04300.13200.10730.09410.079722.5000131.500022.7143
Gutoski et al. [115]0.28100.10700.15300.18030.16630.154340.7500243.583342.2619
Zhong et al. [3]0.17400.02300.11100.10270.07630.051013.4167081.083316.3571
Esq. et al. [117]0.29000.13000.17000.19670.18580.176243.5000255.666744.4762
Li et al. [120]0.18800.02900.13400.11700.09010.063521.8333128.166723.7857
Hao et al. [121]0.17500.03100.13400.11330.08990.066021.1667124.666722.8810
Zhang et al. [13]0.16400.04100.14800.11770.09980.080528.4167158.750028.7143
Shao et al. [123]0.22400.05100.14700.14070.11890.097233.6667198.666735.2857
Hu et al. [126]0.19300.14700.19000.17670.17530.173942.3333251.000042.2857
Zhang et al. [127]0.05800.07100.19500.10800.09290.082324.9167142.583320.0000
Wang et al. [128]0.12000.11000.13000.12000.11970.119430.4167170.416729.4286
Feng et al. [130]0.16400.09200.18700.14770.14130.134438.2500229.583337.3810
Liu et al. [28]0.14900.03400.13500.10600.08810.068917.7500112.250017.7143
AE-Unet (Ours)0.15200.09800.17500.14170.13760.133337.1667222.666735.6667
AEcUnet (Ours)0.13800.06600.13700.11370.10770.101028.9167166.250028.7381
AEnUnet (Ours)0.12800.04300.12900.10000.08920.077317.1667099.166716.7143
AEcnUnet (Ours)0.11200.02900.12600.08900.07420.058408.2500053.083307.7619
AEaUnet (Ours)0.12500.03100.11300.08970.07590.061109.5833055.583310.0000
AEcaUnet (Ours)0.08200.01100.08400.05900.04230.026102.5000012.333302.8571

11.4. Validation of Fair Comparisons for G 3

Figure 19 depicts the Nemenyi [141] post hoc critical distance diagrams at the level of significance α = 0.10 using both experimental and mean fAUC values in Table 8. From Figure 19, it is noticeable that the hypothesis on the difference of AEcaUnet (Ours) vs. Shao et al. [123] is statistically significant. Similarly, another 61 hypotheses on the differences of this G 3 group are statistically significant, as their distance differences are greater than 28.3372 at 90% confidence limit.
While the performance of the methods in group G 3 are remarkably different from their alternatives, and AEcaUnet (Ours) is on top of the list. Clearly, the performance of the method of AEcaUnet (Ours) is remarkably different than that of Shao et al. [123]. On the other hand, the performance of the methods of Wang et al. [84], Xia et al. [104], Mu et al. [109], and others is not remarkably different than that of Shao et al. [123] at confidence limit of 90%. Explicitly, in group G 3 at confidence limit of 90%, the method of AEcaUnet (Ours) outperformed its alternatives (e.g., Wang et al. [84], Xia et al. [104], Mu et al. [109], and etc.). This also agrees with the average ranking of aligned Friedman [138] and Quade [139] in Table 8.

11.5. Average Ranking of G 4

Observing fAUC values in Table 9, it is clear that Zhang et al. [127], AEcaUnet (Ours), Vu et al. [108], and Zahid et al. [87] showed the best performance for the datasets of UCSD-Ped1 [31], UCSD-Ped2 [31], CUHK-Avenue [32], and ShanghaiTech-Campus [18], respectively, in their associated experimental setups. Moreover, Mu et al. [109] obtained the best fAUC arithmetic mean from their experimental setup, whereas AEcaUnet (Ours) obtained the best fAUC geometric and harmonic means. The tests of Friedman [135], aligned Friedman [138], and Quade [139] have been applied to the fAUC values in Table 9 for obtaining the average ranking of each model. The obtained average ranking results have been recorded in Table 9 (right part) too. The average ranks obtained by each method in the Friedman [135] test were considered Friedman statistic (distributed according to chi-square with 25 degrees of freedom) of 95.531136 along with the computed p-value of 0.00000000001. The average ranks obtained by each method in the aligned Friedman [138] test were considered the aligned Friedman statistic (distributed according to chi-square with 25 degrees of freedom) of 86.675237 along with the computed p-value of 0.000000009964. The average ranks obtained by each method in the Quade [139] test were considered Quade statistic (distributed according to F-distribution with 25 and 150 degrees of freedom) of 3.380389 along with the computed p-value of 0.000001997843. From the Friedman [135] test, AEcaUnet (Ours) gained the best rank with the score of 2.8571, whereas Mu et al. [109], Vu et al. [108], AEcnUnet (Ours), and AEaUnet (Ours) obtained the second, third, fourth, and fifth best ranks with the scores of 4.7143, 5.6429, 7.1429, and 8.2143, respectively.
Table 9. Multiple comparison test for G 4 using fAUC. Column-wise the best numerical result is shown in bold.
Table 9. Multiple comparison test for G 4 using fAUC. Column-wise the best numerical result is shown in bold.
ModelsExperimental Results AnalysisStatistically Analysis of Experimental Results
fAUC Scores from DatasetsMean of fAUC ScoresAverage Ranking
Ped1 [31]Ped2 [31]A. [32]Cam. [18]ArithmeticGeometricHarmonicF. [135]A. F. [138]Q. [139]
WangCYJT [79]0.16600.03700.11700.23400.13850.11390.087212.7143091.857113.3571
Zahid et al. [87]0.41500.21100.25000.06000.23400.19040.143822.1429151.142920.5000
Roy et al. [91]0.15000.02500.13000.19000.12370.09810.067108.2857055.142909.0357
Lu et al. [94]0.13700.03800.14200.22100.13450.11310.088512.1429086.857111.8214
Tang et al. [96]0.17000.04000.16000.28000.16250.13210.098319.9286129.642919.6250
Song et al. [101]0.09500.09300.10800.30000.14900.13010.118116.7857110.785715.5000
Sun et al. [103]0.09800.09000.11100.07800.09430.09350.092708.5000053.928608.3929
Wu et al. [107]0.11500.01200.15300.27200.13800.08710.039109.7143070.142911.0714
Vu et al. [108]0.15000.04000.08000.06300.08330.07420.066605.6429034.071407.0536
Mu et al. [109]0.04800.05300.10300.07900.07080.06750.064404.7143026.714305.3214
LiLS [110]0.14700.04500.10900.26000.14020.11700.095113.2857093.000013.2321
Sayp. et al. [113]0.14700.04300.13200.27000.14800.12250.096816.1429106.857115.6786
Zhong et al. [3]0.17400.02300.11100.29300.15030.10680.064912.6429087.785714.8571
Esq. et al. [117]0.29000.13000.17000.13000.18000.16990.161822.0000143.714320.5000
Li et al. [120]0.18800.02900.13400.21800.14220.11230.077112.5000091.500013.5179
Hao et al. [121]0.17500.03100.13400.26200.15050.11750.081215.5714105.571416.3393
Zha. et al. [13]0.16400.04100.14800.27300.15650.12840.097818.7143124.000018.6071
Shao et al. [123]0.22400.05100.14700.28300.17630.14760.116321.8571142.857121.9643
Zha. et al. [127]0.05800.07100.19500.19700.13020.11210.096012.7143089.000011.6786
AE-Unet (Ours)0.15200.09800.17500.26600.17280.16230.152322.2857149.428621.1429
AEcUnet (Ours)0.13800.06600.13700.23900.14500.13140.118117.5000118.357116.0357
AEnUnet (Ours)0.12800.04300.12900.22600.13150.11250.092511.7857081.357111.6607
AEcnUnet (Ours)0.11200.02900.12600.21800.12120.09720.071507.1429055.142907.3393
AEaUnet (Ours)0.12500.03100.11300.22000.12220.09910.074608.2143057.214308.4821
AEcaUnet (Ours)0.08200.01100.08400.20200.09470.06250.033302.8571022.428603.4643
The method of AEcaUnet (Ours) ranked as the first from both aligned Friedman [138] and Quade [139] tests. On the average, in group G 4 , AEcaUnet (Ours) outperformed its alternative methods, e.g., Mu et al. [109], Vu et al. [108], AEcnUnet (Ours), AEaUnet (Ours), Sun et al. [103], Roy et al. [91], Wu et al. [107], etc.

11.6. Validation of Fair Comparisons for G 4

Figure 20 depicts the Nemenyi [141] post hoc critical distance diagrams at the level of significance α = 0.10 using fAUC values in Table 9.
From Figure 19, it is noticeable that the hypothesis on the difference of AEcaUnet (Ours) vs. Zhang et al. [13] is statistically significant. Likewise, another 14 hypotheses on the differences of this G 4 group are statistically significant, as their distance differences are greater than 14.3905 at 95% confidence limit. In group G 4 , the performance of the methods of AEcaUnet (Ours), Mu et al. [109], and Vu et al. [108] are statistically significant at α = 0.05 from their alternatives. Clearly, the performance of the method of AEcaUnet (Ours) is remarkably different than that of Zhang et al. [13]. Yet, the performance of the methods of Mu et al. [109] and Vu et al. [108] is not remarkably different than that of Zhang et al. [13] at confidence limit of 95%. Explicitly, in group G 4 at confidence limit of 95%, the method of AEcaUnet (Ours) outperformed its alternatives (e.g., Mu et al. [109], Vu et al. [108], and etc.). This also agrees with the average ranking of aligned Friedman [138] and Quade [139] in Table 9.

11.7. Average Ranking of G 5

By adjudging fAUC scores in Table 10, it is clear that Zhang et al. [127], AEcaUnet (Ours), Xia et al. [104], and Roy et al. [91] showed the best performance for the datasets of UCSD-Ped1 [31], UCSD-Ped2 [31], CUHK-Avenue [32], and UMN [36], respectively, in their associated experimental setups. Moreover, AEcaUnet (Ours) obtained the best fAUC arithmetic and geometric means from our experimental setup, whereas Roy et al. [91] achieved the best fAUC harmonic mean. However, the tests of Friedman [135], aligned Friedman [138], and Quade [139] have been applied to the fAUC values in Table 7 for obtaining the average ranking of each model. The obtained average ranking results have been recorded in Table 7 (right part) too. The average ranks obtained by each method in the Friedman [135] test were considered Friedman statistic (distributed according to chi-square with 11 degrees of freedom) of 44.428571 along with computed p-value of 0.000006. The average ranks obtained by each method in the aligned Friedman [138] test were considered the aligned Friedman statistic (distributed according to chi-square with 11 degrees of freedom) of 44.071296 along with the computed p-value of 0.000007061196. The average ranks obtained by each method in the Quade [139] test were considered Quade statistic (distributed according to F-distribution with 11 and 66 degrees of freedom) of 4.098057 along with computed p-value of 0.000136314342. From the rigorous statistical point of view, AEcaUnet (Ours) gained the best rank with the score of 1.8571 using the Friedman [135] test, whereas Roy et al. [91], AEaUnet (Ours), and Xia et al. [104] obtained the second, third, and fourth best ranks with the scores of 3.7857, 4.2143, and 4.7143, respectively. Using the aligned Friedman [138] test, AEcaUnet (Ours) attained the best rank with the score of 8.7143. Considering the Quade [139] test, AEcaUnet (Ours) also obtained the best rank with the score of 2.1429.
Table 10. Multiple comparison test for G 5 using fAUC. Column-wise the best numerical result is shown in bold.
Table 10. Multiple comparison test for G 5 using fAUC. Column-wise the best numerical result is shown in bold.
ModelsExperimental Results AnalysisStatistically Analysis of Experimental Results
fAUC Scores from DatasetsMean of fAUC ScoresAverage Ranking
Ped1 [31]Ped2 [31]A. [32]UMN [36]ArithmeticGeometricHarmonicF. [135]A. F. [138]Q. [139]
Roy et al. [91]0.15000.02500.13000.00300.07700.03480.010303.785726.071404.6071
Wu0S [98]0.17000.04000.13000.11000.11250.09930.083909.642963.214309.6071
Xia et al. [104]0.12000.03400.07800.03000.06550.05560.047704.714328.714304.8929
LiCL [111]0.09500.07100.16500.02000.08770.06870.049607.571448.142907.0357
Gut. et al. [115]0.28100.10700.15300.00800.13730.07790.027708.428657.428608.1786
Zha. et al. [127]0.05800.07100.19500.01200.08400.05570.033405.928639.642905.6250
AE-Unet (Ours)0.15200.09800.17500.07000.12370.11620.108711.142975.142910.9286
AEcUnet (Ours)0.13800.06600.13700.03500.09400.08130.068609.000060.000008.7857
AEnUnet (Ours)0.12800.04300.12900.02300.08080.06360.048606.714343.857106.6429
AEcnUnet (Ours)0.11200.02900.12600.02400.07280.05600.043005.000031.000004.9643
AEaUnet (Ours)0.12500.03100.11300.02000.07230.05440.040404.214328.071404.5893
AEcaUnet (Ours)0.08200.01100.08400.01300.04750.03150.020801.857108.714302.1429

11.8. Validation of Fair Comparisons for G 5

The methods in G 5 show statistically significant performance difference at both α = 0.05 and α = 0.10 . Figure 21 depicts the Nemenyi [141] post hoc critical distance diagrams at the level of significance α = 0.10 (i.e., 90% confidence limit) using fAUC scores in Table 10.
From Figure 21, it is noticeable that the hypothesis on the difference of AEcaUnet (Ours) vs. Gutoski et al. [115] is statistically significant. Similarly, another eight hypotheses on the differences of this G 5 group are statistically significant, as their distance differences are greater than 5.8396 at 90% confidence limit. The performance of the method of AEcaUnet (Ours) is remarkably different than that of Gutoski et al. [115], AEcUnet (Ours), Wu0S [98], and AE-Unet (Ours). Nevertheless, the performance of the method of Roy et al. [91] is not remarkably different than that of Gutoski et al. [115] and AEcUnet (Ours) at a confidence limit of 90%. Consequently, at confidence limit of 90% AEcaUnet (Ours) is a better performative method than Roy et al. [91]. Explicitly, in group G 5 at confidence limit of 90%, AEcaUnet (Ours) outperformed Roy et al. [91], AEaUnet (Ours), Xia et al. [104], AEcnUnet (Ours), Zhang et al. [127], etc. This also agrees with the average ranking of aligned Friedman [138] and Quade [139] in Table 10.

11.9. Average Ranking of G 6

By perceiving fAUC values in Table 11, it is clear that Zhang et al. [127], AEcaUnet (Ours), and Roy et al. [91] showed the best performance for the datasets of UCSD-Ped1 [31], UCSD-Ped2 [31], and UMN [36], respectively, in their associated experimental setups. Moreover, AEcaUnet (Ours) obtained the best fAUC arithmetic mean of 0.0353 from experimental setup, whereas Roy et al. [91] obtained the best fAUC geometric and harmonic means. The tests of Friedman [135], aligned Friedman [138], and Quade [139] have been applied to the fAUC scores in Table 11 for obtaining the average ranking of each model. The obtained average ranking results are recorded in Table 11 (right part). The average ranks obtained by each method in the Friedman [135] test were considered Friedman statistic (distributed according to chi-square with 13 degrees of freedom) of 44.871429 along with the computed p-value of 0.000022. The average ranks obtained by each method in the aligned Friedman [138] test were considered the aligned Friedman statistic (distributed according to chi-square with 13 degrees of freedom) of 43.121293 along with the computed p-value of 0.000042872968. The average ranks obtained by each method in the Quade [139] test were considered Quade statistic (distributed according to F-distribution with 13 and 65 degrees of freedom) of 1.536464 along with the computed p-value of 0.000155833749. From the rigorous statistical point of view, AEcaUnet (Ours) obtained the best rank with the score of 2.1667 using the Friedman [135] test, whereas Roy et al. [91] the second best rank with the score of 3.1667. Using the aligned Friedman [138] test, AEcaUnet (Ours) obtained the best rank with the score of 9.6667. Considering the Quade [139] test, AEcaUnet (Ours) also secured the best rank with the score of 2.3810.
Table 11. Multiple comparison test for G 6 using fAUC. Column-wise the best numerical result is shown in bold.
Table 11. Multiple comparison test for G 6 using fAUC. Column-wise the best numerical result is shown in bold.
ModelsExperimental Results AnalysisStatistically Analysis of Experimental Results
fAUC Scores from DatasetsMean of fAUC ScoresAverage Ranking
Ped1 [31]Ped2 [31]UMN [36]ArithmeticGeometricHarmonicF. [135]A. F. [138]Q. [139]
Roy et al. [91]0.15000.02500.00300.05930.02240.007903.166720.500004.1905
Wu et al. [92]0.16000.07600.00700.08100.04400.018507.500045.333308.0000
Wu0S [98]0.17000.04000.11000.10670.09080.075111.750067.916711.8571
Xia et al. [104]0.12000.03400.03000.06130.04970.042207.500038.000007.2857
LiCL [111]0.09500.07100.02000.06200.05130.040207.833341.166706.8571
Gutoski et al. [115]0.28100.10700.00800.13200.06220.021810.000057.833310.2857
Alafif et al. [122]0.17200.04300.01900.07800.05200.036708.916750.250009.1667
Zhang et al. [127]0.05800.07100.01200.04700.03670.026204.250025.916703.7381
AE-Unet (Ours)0.15200.09800.07000.10670.10140.096612.750075.416712.1905
AEcUnet (Ours)0.13800.06600.03500.07970.06830.058910.500060.666710.0476
AEnUnet (Ours)0.12800.04300.02300.06470.05020.040008.083342.083307.9762
AEcnUnet (Ours)0.11200.02900.02400.05500.04270.035305.333328.666705.4286
AEaUnet (Ours)0.12500.03100.02000.05870.04260.033205.250031.583305.5952
AEcaUnet (Ours)0.08200.01100.01300.03530.02270.016702.166709.666702.3810

11.10. Validation of Fair Comparisons for G 6

Figure 19 depicts the Nemenyi [141] post hoc critical distance diagrams at the level of significance α = 0.10 using fAUC scores in Table 8.
From Figure 22, it is noticeable that the hypothesis on the difference of AEcaUnet (Ours) vs. Gutoski et al. [115] is statistically significant. Likewise, another six hypotheses on the differences of this G 6 group are statistically significant, as their distance differences are greater than 7.5355 at 90% confidence limit. The performance of the method of AEcaUnet (Ours) is significantly better than that of Gutoski et al. [115], AEcUnet (Ours), Wu0S [98], and AE-Unet (Ours). Nevertheless, the performance of the method of Roy et al. [91] is not statistically significant than that of Gutoski et al. [115] and AEcUnet (Ours) at confidence limit of 90%. Consequently, at confidence limit of 90% AEcaUnet (Ours) performs better than Roy et al. [91]. Explicitly, in group G 6 at confidence limit of 90%, AEcaUnet (Ours) outperformed Roy et al. [91], Zhang et al. [127], etc. This also agrees with the average ranking of aligned Friedman [138] and Quade [139] in Table 11.
In summary, the aforementioned rigorous statistical analysis on groups G 1 , G 2 , G 3 , G 4 , G 5 , and G 6 shows the ranking measures of Table 5 and the method of AEcaUnet (Ours) takes place on the top of the ranking of each group. This shows that AEcaUnet (Ours) (i.e., a skip connected autoencoder with attention block U-Net) possesses the ability to extract high-quality features from the available videos, and also it confirms a certain degree of augmentation of the reconstruction error gap.

11.11. Limitation of Our Framework

Although some of our proposed models demonstrated their superiority among many methods and various popular datasets statistically, they did not achieve an individual and the best experimental scores from any dataset from Table 5. The types of anomalies in dissimilar scenarios are not identical. Our entire frame based evaluation can preserve the complete appearance of target objects in video frame. Our models justify using entire frame based anomaly score whether a video frame belongs to a normal event or an abnormal event, but it does not detect the location of abnormal events on the frame.

11.12. Future Work

Fundamentally, our rpNet is a natural extension of video classification-based on CNNs. Recently, it is demonstrated by evidence that a pure transformer-based architecture can outperform its convolutional counterparts in image classification [142]. The transformer does not process the input in order, sequentially, but in parallel. For each element, the transformer integrates information from the other elements via self-attention. It can better capture long range contextual relationships in video. The vision transformer (ViT) is a successful application of a transformer in computer vision. In future, we wish to augment our generalized architecture by incorporating the ViT technologies with extracting spatiotemporal tokens from the input video, which would be then encoded by a series of the ViT layers. Moreover, we used Sigmoid activation function where the output is guaranteed between 0 and 1, but Sigmoid activation is tough. Nevertheless, ViT does not need any Sigmoid or Tanh activation. The ViT performs very favorably over CNNs only if the dataset for pretraining is sufficiently large [143]. For example, an experimental setup of Dosovitskiy et al. [143] claimed that under 100 million images the accuracy of the variants of ResNet [60] (e.g., ResNet50x1 (BiT) and ResNet152x2 (BiT)) was better than that of the ViT. Yet, the accuracy of those variants did not improve as the number of samples grew from 100 million images to 300 million images. Conversely, the ViT performed positively and hence it outperformed all of its convolutional counterparts considering 300 million images [143]. In short, the bigger the datasets are, the greater the power of the ViT over CNNs is. However, to obtain a huge crowd dataset (e.g., 300 million frames or more) is still a challenging task in computer vision.

12. Conclusions

We proposed six deep models from a generalized architecture by fusing several alternatives of prediction and reconstruction networks to detect anomaly in video efficiently. The fusion of networks guaranteed a certain degree of augmentation of the reconstruction error gap. Experiments on five benchmark datasets demonstrated the potential of our models, and the detailed discussion verified their effectiveness to detect abnormal video events. Some of our models showed promising results within their ability to extract good quality of features. By confirming improved error gap and extracting better quality of features from the available videos, our proposed AEcaUnet demonstrated its superiority in statistically, and the statistical results were based on the experimental results of miscellaneous methods and several most popular crowd datasets. We noticed that a skip connected autoencoder with attention block U-Net can extract high-quality features needed for video anomaly detection. A statistical analysis of the results needs a higher confidence limit to support its claims. We applied the confidence limits of 90% and 95% to support the claims on the superiority of our models. In general, most of our proposed models are more performative and sophisticated than the existing ones (e.g., Liu et al. [2], Nguyen et al. [27], Zhong et al. [3], Zhang et al. [13], Liu et al. [28], and etc.), and henceforth, they can be applied in complex and realistic situations.

Author Contributions

Conceptualization, M.H.S.; methodology, M.H.S., L.J. and C.W.O.; software, M.H.S.; validation, M.H.S., L.J. and C.W.O.; formal analysis, M.H.S., L.J. and C.W.O.; investigation, M.H.S., L.J. and C.W.O.; resources, M.H.S., L.J. and C.W.O.; data curation, M.H.S., L.J. and C.W.O.; writing—original draft preparation, M.H.S., L.J. and C.W.O.; writing—review and editing, M.H.S., L.J. and C.W.O.; visualization, M.H.S.; supervision, L.J. and C.W.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work is a part of the AI4CITIZENS research project (number 320783) supported by the Research Council of Norway.

Data Availability Statement

The datasets analyzed during the current study are publicly available and their web links are given in Table 3.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hasan, M.; Choi, J.; Neumann, J.; Chowdhury, A.K.R.; Davis, L.S. Learning Temporal Regularity in Video Sequences. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 733–742. [Google Scholar]
  2. Liu, W.; Luo, W.; Lian, D.; Gao, S. Future Frame Prediction for Anomaly Detection—A New Baseline. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6536–6545. [Google Scholar]
  3. Zhong, Y.; Chen, X.; Jiang, J.; Ren, F. A cascade reconstruction model with generalization ability evaluation for anomaly detection in videos. Pattern Recognit. 2022, 122, 108336. [Google Scholar] [CrossRef]
  4. Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; van den Hengel, A. Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar]
  5. Park, H.; Noh, J.; Ham, B. Learning Memory-Guided Normality for Anomaly Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14360–14369. [Google Scholar]
  6. Mathieu, M.; Couprie, C.; LeCun, Y. Deep multi-scale video prediction beyond mean square error. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  7. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 18 May 2015; Volume 9351, pp. 234–241. [Google Scholar]
  8. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  9. Dosovitskiy, A.; Fischer, P.; Ilg, E.; Häusser, P.; Hazirbas, C.; Golkov, V.; van der Smagt, P.; Cremers, D.; Brox, T. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
  10. Wang, Z.; Zou, N.; Shen, D.; Ji, S. Non-local U-Net for Biomedical Image Segmentation. arXiv 2018, arXiv:1812.04103. [Google Scholar] [CrossRef]
  11. Wang, X.; Girshick, R.B.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
  12. Buades, A.; Coll, B.; Morel, J.M. A Non-Local Algorithm for Image Denoising. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–25 June 2005; pp. 60–65. [Google Scholar]
  13. Zhang, Q.; Feng, G.; Wu, H. Surveillance video anomaly detection via non-local U-Net frame prediction. Multim. Tools Appl. 2022, 81, 27073–27088. [Google Scholar] [CrossRef]
  14. Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.C.H.; Heinrich, M.P.; Misawa, K.; Mori, K.; McDonagh, S.G.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
  15. Vakanski, A.; Xian, M.; Freer, P. Attention Enriched Deep Learning Model for Breast Tumor Segmentation in Ultrasound Images. arXiv 2019, arXiv:1910.08978. [Google Scholar] [CrossRef]
  16. Xu, D.; Ricci, E.; Yan, Y.; Song, J.; Sebe, N. Learning Deep Representations of Appearance and Motion for Anomalous Event Detection. In Proceedings of the British Machine Vision Conference (BMVC), Swansea, UK, 7–10 September 2015; pp. 8.1–8.12. [Google Scholar]
  17. Chong, Y.S.; Tay, Y.H. Abnormal Event Detection in Videos Using Spatiotemporal Autoencoder. In Proceedings of the 14th International Symposium on Advances in Neural Networks (ISNN), Hokkaido, Japan, 21–26 June 2017; Volume 10262, pp. 189–196. [Google Scholar]
  18. Luo, W.; Liu, W.; Gao, S. A Revisit of Sparse Coding Based Anomaly Detection in Stacked RNN Framework. In Proceedings of the International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 341–349. [Google Scholar]
  19. Sabokrou, M.; Khalooei, M.; Fathy, M.; Adeli, E. Adversarially Learned One-Class Classifier for Novelty Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3379–3388. [Google Scholar]
  20. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015; pp. 802–810. [Google Scholar]
  21. Giorno, A.D.; Bagnell, J.A.; Hebert, M. A Discriminative Framework for Anomaly Detection in Large Videos. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Volume 9909, pp. 334–349. [Google Scholar]
  22. Ionescu, R.T.; Smeureanu, S.; Alexe, B.; Popescu, M. Unmasking the Abnormal Events in Video. In Proceedings of the International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2914–2922. [Google Scholar]
  23. Lotter, W.; Kreiman, G.; Cox, D.D. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  24. Van Amersfoort, J.R.; Kannan, A.; Ranzato, M.A.; Szlam, A.; Tran, D.; Chintala, S. Transformation-Based Models of Video Sequences. arXiv 2017, arXiv:1701.08435. [Google Scholar]
  25. Chen, B.; Wang, W.; Wang, J. Video Imagination from a Single Image with Transformation Generation. In Proceedings of the Thematic Workshops of ACM Multimedia 2017, Mountain View, CA, USA, 23–27 October 2017; pp. 358–366. [Google Scholar]
  26. Doshi, K.; Yilmaz, Y. Online anomaly detection in surveillance videos with asymptotic bound on false alarm rate. Pattern Recognit. 2021, 114, 107865. [Google Scholar] [CrossRef]
  27. Nguyen, T.N.; Meunier, J. Anomaly Detection in Video Sequence With Appearance-Motion Correspondence. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1273–1283. [Google Scholar]
  28. Liu, T.; Zhang, C.; Niu, X.; Wang, L. Spatio-temporal prediction and reconstruction network for video anomaly detection. PLoS ONE 2022, 17, e0265564. [Google Scholar] [CrossRef] [PubMed]
  29. Ku, T.; Yang, Q.; Zhang, H. Multilevel feature fusion dilated convolutional network for semantic segmentation. Int. J. Adv. Robot. Syst. 2021, 18, 17298814211007665. [Google Scholar] [CrossRef]
  30. Song, H.; Wang, W.; Zhao, S.; Shen, J.; Lam, K.M. Pyramid Dilated Deeper ConvLSTM for Video Salient Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11215, pp. 744–760. [Google Scholar]
  31. Chan, A.B.; Liang, Z.J.; Vasconcelos, N. Privacy preserving crowd monitoring: Counting people without people models or tracking. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, USA, 24–26 June 2008. [Google Scholar]
  32. Lu, C.; Shi, J.; Jia, J. Abnormal Event Detection at 150 FPS in MATLAB. In Proceedings of the International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013; pp. 2720–2727. [Google Scholar]
  33. Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1647–1655. [Google Scholar]
  34. Lee, Y.; Hwang, H.; Shin, J.; Oh, B.T. Pedestrian detection using multi-scale squeeze-and-excitation module. Mach. Vis. Appl. 2020, 31, 55. [Google Scholar] [CrossRef]
  35. Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks for Action Recognition in Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2740–2755. [Google Scholar] [CrossRef] [Green Version]
  36. UMN. Detection of Unusual Crowd Activities in Both Indoor and Outdoor Scenes. 2021. Available online: http://mha.cs.umn.edu/proj_events.shtml#crowd (accessed on 20 January 2023).
  37. Shehu, H.A.; Ramadan, A.R.; Sharif, M.H. Artificial intelligence tools and their capabilities. Ploms AI 2021, 1, 1–7. [Google Scholar]
  38. Mahmoudi, S.A.; Sharif, M.H.; Ihaddadene, N.; Djeraba, C. Abnormal event detection in real time video. In Proceedings of the First International Workshop on Multimodal Interactions Analysis of Users in a Controlled Environment (MIAUCE), Chania, Greece, 24 October 2008. [Google Scholar]
  39. Sharif, M.H. An Eigenvalue Approach to Detect Flows and Events in Crowd Videos. J. Circuits Syst. Comput. 2017, 26, 1750110:1–1750110:50. [Google Scholar] [CrossRef] [Green Version]
  40. Ahmed, M.S.; Sharif, M.H.; Ihaddadene, N.; Djeraba, C. Detection of Abnormal Motions in Video. In Proceedings of the First International Workshop on Multimodal Interactions Analysis of Users in a Controlled Environment (MIAUCE), Chania, Greece, 24 October 2008; pp. 1–4. [Google Scholar]
  41. Kwon, K.; Lee, S.; Kim, S. AI-Based Home Energy Management System Considering Energy Efficiency and Resident Satisfaction. IEEE Internet Things J. 2022, 9, 1608–1621. [Google Scholar] [CrossRef]
  42. Sharif, M.H. A numerical approach for tracking unknown number of individual targets in videos. Digit. Signal Process. 2016, 57, 106–127. [Google Scholar] [CrossRef]
  43. Yavariabdi, A.; Kusetogullari, H. Change Detection in Multispectral Landsat Images Using Multiobjective Evolutionary Algorithm. IEEE Geosci. Remote. Sens. Lett. 2017, 14, 414–418. [Google Scholar] [CrossRef] [Green Version]
  44. Kusetogullari, H.; Yavariabdi, A.; Celik, T. Unsupervised Change Detection in Multitemporal Multispectral Satellite Images Using Parallel Particle Swarm Optimization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2151–2164. [Google Scholar] [CrossRef]
  45. Wakili, M.A.; Shehu, H.A.; Sharif, M.H.; Sharif, M.H.U.; Umar, A.; Kusetogullari, H.; Ince, I.F.; Uyaver, S. Classification of Breast Cancer Histopathological Images Using DenseNet and Transfer Learning. Comput. Intell. Neurosci. 2022, 2022, 1–31. [Google Scholar] [CrossRef]
  46. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  47. Kusetogullari, H.; Yavariabdi, A.; Hall, J.; Lavesson, N. DIGITNET: A Deep Handwritten Digit Detection and Recognition Method Using a New Historical Handwritten Digit Dataset. Big Data Res. 2021, 23, 100182. [Google Scholar] [CrossRef]
  48. Kusetogullari, H.; Yavariabdi, A.; Cheddad, A.; Grahn, H.; Hall, J. ARDIS: A Swedish historical handwritten digit dataset. Neural Comput. Appl. 2020, 32, 16505–16518. [Google Scholar] [CrossRef] [Green Version]
  49. Shehu, H.A.; Sharif, M.H.; Ramadan, R.A. Distributed Mutual Exclusion Algorithms for Intersection Traffic Problems. IEEE Access 2020, 8, 138277–138296. [Google Scholar]
  50. Ubaid, M.T.; Saba, T.; Draz, H.U.; Rehman, A.; Khan, M.U.G.; Kolivand, H. Intelligent Traffic Signal Automation Based on Computer Vision Techniques Using Deep Learning. IT Prof. 2022, 24, 27–33. [Google Scholar] [CrossRef]
  51. Englund, C.; Aksoy, E.E.; Alonso-Fernandez, F.; Cooney, M.D.; Pashami, S.; Åstrand, B. AI in Smart Cities: Challenges and approaches to enable road vehicle automation and smart traffic control. arXiv 2021, arXiv:2104.03150. [Google Scholar]
  52. Zhai, S.; Cheng, Y.; Lu, W.; Zhang, Z. Deep Structured Energy Based Models for Anomaly Detection. In Proceedings of the International Conference on Machine Learning (ICML), New York City, NY, USA, 19–24 June 2016; Volume 48, pp. 1100–1109. [Google Scholar]
  53. Roopak, M.; Tian, G.Y.; Chambers, J.A. Multi-objective-based feature selection for DDoS attack detection in IoT networks. IET Netw. 2020, 9, 120–127. [Google Scholar] [CrossRef]
  54. Shehu, H.A.; Sharif, M.H.; Sharif, M.H.U.; Datta, R.; Tokat, S.; Uyaver, S.; Kusetogullari, H.; Ramadan, R.A. Deep Sentiment Analysis: A Case Study on Stemmed Turkish Twitter Data. IEEE Access 2021, 9, 56836–56854. [Google Scholar] [CrossRef]
  55. Yu, X.; Liang, Y.; Lin, X.; Wan, J.; Wang, T.; Dai, H.N. Frequency Feature Pyramid Network With Global-Local Consistency Loss for Crowd-and-Vehicle Counting in Congested Scenes. IEEE Trans. Intell. Transp. Syst. 2022, 23, 9654–9664. [Google Scholar] [CrossRef]
  56. Asres, M.; Cummings, G.; Khukhunaishvili, A.; Parygin, P.; Cooper, S.; Yu, D.; Dittmann, J.; Omlin, C. Long Horizon Anomaly Prediction in Multivariate Time Series with Causal Autoencoders. Eur. Conf. Phm Soc. (Phme) 2022, 7, 21–31. [Google Scholar] [CrossRef]
  57. Sharif, M.H.; Jiao, L.; Omlin, C.W. Deep Crowd Anomaly Detection: State-of-the-Art, Challenges, and Future Research Directions. arXiv 2022, arXiv:2210.13927. [Google Scholar]
  58. Masci, J.; Meier, U.; Ciresan, D.C.; Schmidhuber, J. Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction. In Proceedings of the Artificial Neural Networks and Machine Learning—21st International Conference on Artificial Neural Networks, Espoo, Finland, 14–17 June 2011; Volume 6791, pp. 52–59. [Google Scholar]
  59. Kim, T.; Oh, J.; Kim, N.; Cho, S.; Yun, S.Y. Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Virtual Event, 19–26 August 2021; pp. 2628–2635. [Google Scholar]
  60. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  61. Mao, X.J.; Shen, C.; Yang, Y.B. Image Restoration Using Very Deep Convolutional Encoder-Decoder Networks with Symmetric Skip Connections. In Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2802–2810. [Google Scholar]
  62. Sharif, M.H.; Djeraba, C. An entropy approach for abnormal activities detection in video streams. Pattern Recognit. 2012, 45, 2543–2561. [Google Scholar] [CrossRef]
  63. Isola, P.; Zhu, J.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar]
  64. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  65. Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent Models of Visual Attention. In Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2204–2212. [Google Scholar]
  66. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2017–2025. [Google Scholar]
  67. Jetley, S.; Lord, N.A.; Lee, N.; Torr, P.H.S. Learn to Pay Attention. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  68. Sharif, M.H.; Ihaddadene, N.; Djeraba, C. Crowd behaviour monitoring on the escalator exits. In Proceedings of the 11th International Conference on Computer and Information Technology (ICCIT), Khulna, Bangladesh, 24–27 December 2008; pp. 194–200. [Google Scholar]
  69. Ihaddadene, N.; Sharif, M.H.; Djeraba, C. Crowd behaviour monitoring. In Proceedings of the International Conference on Multimedia, Vancouver, BC, Canada, 27–31 October 2008; pp. 1013–1014. [Google Scholar]
  70. Sharif, M.H.; Djeraba, C. A Simple Method for Eccentric Event Espial Using Mahalanobis Metric. In Proceedings of the Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 14th Iberoamerican Conference on Pattern Recognition, CIARP, Guadalajara, Mexico, 15–18 November 2009; Volume 5856, pp. 417–424. [Google Scholar]
  71. Sharif, M.H.; Djeraba, C. Exceptional motion frames detection by means of spatiotemporal region of interest features. In Proceedings of the International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 981–984. [Google Scholar]
  72. Sharif, M.H.; Ihaddadene, N.; Djeraba, C. Finding and Indexing of Eccentric Events in Video Emanates. J. Multim. 2010, 5, 22–35. [Google Scholar] [CrossRef] [Green Version]
  73. Salomon, D. Data Compression: The Complete Reference; Springer: London, UK, 2007. [Google Scholar]
  74. Sharif, M.H.; Uyaver, S.; Djeraba, C. Crowd Behavior Surveillance Using Bhattacharyya Distance Metric. In Proceedings of the Second International Symposium on Computational Modeling of Objects Represented in Images (CompIMAGE), Buffalo, NY, USA, 5–7 May 2010; pp. 311–323. [Google Scholar]
  75. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
  76. Lloyd, K.; Rosin, P.L.; Marshall, A.D.; Moore, S.C. Detecting violent and abnormal crowd activity using temporal analysis of grey level co-occurrence matrix (GLCM)-based texture measures. Mach. Vis. Appl. 2017, 28, 361–371. [Google Scholar] [CrossRef] [Green Version]
  77. Sanchez, F.L.; Hupont, I.; Tabik, S.; Herrera, F. Revisiting crowd behaviour analysis through deep learning: Taxonomy, anomaly detection, crowd emotions, datasets, opportunities and prospects. Inf. Fusion 2020, 64, 318–335. [Google Scholar] [CrossRef]
  78. Luo, W.; Liu, W.; Gao, S. Remembering history with convolutional LSTM for anomaly detection. In Proceedings of the International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 439–444. [Google Scholar]
  79. Wang, X.; Che, Z.; Yang, K.; Jiang, B.; Tang, J.; Ye, J.; Wang, J.; Qi, Q. Robust Unsupervised Video Anomaly Detection by Multi-Path Frame Prediction. arXiv 2020, arXiv:2011.02763. [Google Scholar]
  80. Chen, D.; Wang, P.; Yue, L.; Zhang, Y.; Jia, T. Anomaly detection in surveillance video based on bidirectional prediction. Image Vis. Comput. 2020, 98, 103915. [Google Scholar] [CrossRef]
  81. Dong, F.; Zhang, Y.; Nie, X. Dual Discriminator Generative Adversarial Network for Video Anomaly Detection. IEEE Access 2020, 8, 88170–88176. [Google Scholar] [CrossRef]
  82. Fan, Y.; Wen, G.; Li, D.; Qiu, S.; Levine, M.D.; Xiao, F. Video anomaly detection and localization via Gaussian Mixture Fully Convolutional Variational Autoencoder. Comput. Vis. Image Underst. 2020, 195, 102920. [Google Scholar] [CrossRef] [Green Version]
  83. Nawaratne, R.; Alahakoon, D.; Silva, D.D.; Yu, X. Spatiotemporal Anomaly Detection Using Deep Learning for Real-Time Video Surveillance. IEEE Trans. Ind. Inform. 2020, 16, 393–402. [Google Scholar] [CrossRef]
  84. Wang, Z.; Yang, Z.; Zhang, Y. A promotion method for generation error-based video anomaly detection. Pattern Recognit. Lett. 2020, 140, 88–94. [Google Scholar] [CrossRef]
  85. Wu, P.; Liu, J.; Li, M.; Sun, Y.; Shen, F. Fast sparse coding networks for anomaly detection in videos. Pattern Recognit. 2020, 107, 107515. [Google Scholar] [CrossRef]
  86. Yang, F.; Yu, Z.; Chen, L.; Gu, J.; Li, Q.; Guo, B. Human-Machine Cooperative Video Anomaly Detection. Proc. ACM Hum. Comput. Interact. 2020, 4, 1–18. [Google Scholar] [CrossRef]
  87. Zahid, Y.; Tahir, M.A.; Durrani, N.M.; Bouridane, A. IBaggedFCNet: An Ensemble Framework for Anomaly Detection in Surveillance Videos. IEEE Access 2020, 8, 220620–220630. [Google Scholar] [CrossRef]
  88. Zhou, J.T.; Zhang, L.; Fang, Z.; Du, J.; Peng, X.; Xiao, Y. Attention-Driven Loss for Anomaly Detection in Video Surveillance. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4639–4647. [Google Scholar] [CrossRef]
  89. Doshi, K.; Yilmaz, Y. Continual Learning for Anomaly Detection in Surveillance Videos. In Proceedings of the Conference on Computer Vision and Pattern Recognition, CVPR Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 1025–1034. [Google Scholar]
  90. Pang, G.; Yan, C.; Shen, C.; van den Hengel, A.; Bai, X. Self-Trained Deep Ordinal Regression for End-to-End Video Anomaly Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 14–19 June 2020; pp. 12170–12179. [Google Scholar]
  91. Roy, P.R.; Bilodeau, G.; Seoud, L. Local Anomaly Detection in Videos using Object-Centric Adversarial Learning. arXiv 2020, arXiv:2011.06722. [Google Scholar]
  92. Wu, C.; Shao, S.; Tunc, C.; Hariri, S. Video Anomaly Detection using Pre-Trained Deep Convolutional Neural Nets and Context Mining. In Proceedings of the International Conference on Computer Systems and Applications, AICCSA, Antalya, Turkey, 2–5 November 2020; pp. 1–8. [Google Scholar]
  93. Ji, X.; Li, B.; Zhu, Y. TAM-Net: Temporal Enhanced Appearance-to-Motion Generative Network for Video Anomaly Detection. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
  94. Lu, Y.; Yu, F.; Reddy, M.K.K.; Wang, Y. Few-Shot Scene-Adaptive Anomaly Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Volume 12350, pp. 125–141. [Google Scholar]
  95. Ramachandra, B.; Jones, M.J.; Vatsavai, R.R. Learning a distance function with a Siamese network to localize anomalies in videos. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 2587–2596. [Google Scholar]
  96. Tang, Y.; Zhao, L.; Zhang, S.; Gong, C.; Li, G.; Yang, J. Integrating prediction and reconstruction for anomaly detection. Pattern Recognit. Lett. 2020, 129, 123–130. [Google Scholar] [CrossRef]
  97. Almazroey, A.A.; Jarraya, S.K. Abnormal Events and Behavior Detection in Crowd Scenes Based on Deep Learning and Neighborhood Component Analysis Feature Selection. In Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV), Cairo, Egypt, 8–10 April 2020; Volume 1153, pp. 258–267. [Google Scholar]
  98. Wu, P.; Liu, J.; Shen, F. A Deep One-Class Neural Network for Anomalous Event Detection in Complex Scenes. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 2609–2622. [Google Scholar] [CrossRef] [PubMed]
  99. Lee, S.; Kim, H.G.; Ro, Y.M. BMAN: Bidirectional Multi-Scale Aggregation Networks for Abnormal Event Detection. IEEE Trans. Image Process. 2020, 29, 2395–2408. [Google Scholar] [CrossRef]
  100. Prawiro, H.; Peng, J.; Pan, T.; Hu, M. Abnormal Event Detection in Surveillance Videos Using Two-Stream Decoder. In Proceedings of the International Conference on Multimedia & Expo Workshops, ICME Workshops, London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
  101. Song, H.; Sun, C.; Wu, X.; Chen, M.; Jia, Y. Learning Normal Patterns via Adversarial Attention-Based Autoencoder for Abnormal Event Detection in Videos. IEEE Trans. Multim. 2020, 22, 2138–2148. [Google Scholar] [CrossRef]
  102. Yan, S.; Smith, J.S.; Lu, W.; Zhang, B. Abnormal Event Detection From Videos Using a Two-Stream Recurrent Variational Autoencoder. IEEE Trans. Cogn. Dev. Syst. 2020, 12, 30–42. [Google Scholar] [CrossRef]
  103. Sun, C.; Jia, Y.; Song, H.; Wu, Y. Adversarial 3D Convolutional Auto-Encoder for Abnormal Event Detection in Videos. IEEE Trans. Multim. 2021, 23, 3292–3305. [Google Scholar] [CrossRef]
  104. Xia, L.; Li, Z. An abnormal event detection method based on the Riemannian manifold and LSTM network. Neurocomputing 2021, 463, 144–154. [Google Scholar] [CrossRef]
  105. Feng, X.; Song, D.; Chen, Y.; Chen, Z.; Ni, J.; Chen, H. Convolutional Transformer based Dual Discriminator Generative Adversarial Networks for Video Anomaly Detection. In Proceedings of the MM ’21: ACM Multimedia Conference, Virtual Event, 20–24 October 2021; pp. 5546–5554. [Google Scholar]
  106. Zhang, Y.; Nie, X.; He, R.; Chen, M.; Yin, Y. Normality Learning in Multispace for Video Anomaly Detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3694–3706. [Google Scholar] [CrossRef]
  107. Wu, R.; Li, S.; Chen, C.; Hao, A. Improving video anomaly detection performance by mining useful data from unseen video frames. Neurocomputing 2021, 462, 523–533. [Google Scholar] [CrossRef]
  108. Vu, T.; Boonaert, J.; Ambellouis, S.; Taleb-Ahmed, A. Multi-Channel Generative Framework and Supervised Learning for Anomaly Detection in Surveillance Videos. Sensors 2021, 21, 3179. [Google Scholar] [CrossRef] [PubMed]
  109. Mu, H.; Sun, R.; Yuan, G.; Shi, G. Positive unlabeled learning-based anomaly detection in videos. Int. J. Intell. Syst. 2021, 36, 3767–3788. [Google Scholar] [CrossRef]
  110. Li, B.; Leroux, S.; Simoens, P. Decoupled appearance and motion learning for efficient anomaly detection in surveillance video. Comput. Vis. Image Underst. 2021, 210, 103249. [Google Scholar] [CrossRef]
  111. Li, N.; Chang, F.; Liu, C. Spatial-Temporal Cascade Autoencoder for Video Anomaly Detection in Crowded Scenes. IEEE Trans. Multim. 2021, 23, 203–215. [Google Scholar] [CrossRef]
  112. Cai, Y.; Liu, J.; Guo, Y.; Hu, S.; Lang, S. Video anomaly detection with multi-scale feature and temporal information fusion. Neurocomputing 2021, 423, 264–273. [Google Scholar] [CrossRef]
  113. Saypadith, S.; Onoye, T. Video Anomaly Detection Based on Deep Generative Network. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Republic of Korea, 23–26 May 2021; pp. 1–5. [Google Scholar]
  114. Luo, W.; Liu, W.; Lian, D.; Tang, J.; Duan, L.; Peng, X.; Gao, S. Video Anomaly Detection with Sparse Coding Inspired Deep Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1070–1084. [Google Scholar] [CrossRef]
  115. Gutoski, M.; Ribeiro, M.; Hattori, L.T.; Aquino, N.M.R.; Lazzaretti, A.E.; Lopes, H.S. A Comparative Study of Transfer Learning Approaches for Video Anomaly Detection. Int. J. Pattern Recognit. Artif. Intell. 2021, 35, 2152003:1–2152003:27. [Google Scholar] [CrossRef]
  116. Chang, Y.; Tu, Z.; Xie, W.; Luo, B.; Zhang, S.; Sui, H.; Yuan, J. Video anomaly detection with spatio-temporal dissociation. Pattern Recognit. 2022, 122, 108213. [Google Scholar] [CrossRef]
  117. Esquivel, E.C.; Zavaleta, Z.J.G. An Examination on Autoencoder Designs for Anomaly Detection in Video Surveillance. IEEE Access 2022, 10, 6208–6217. [Google Scholar] [CrossRef]
  118. Park, C.; Cho, M.; Lee, M.; Le, S. FastAno: Fast Anomaly Detection via Spatio-temporal Patch Transformation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2022; pp. 2249–2259. [Google Scholar]
  119. Doshi, K.; Yilmaz, Y. A Modular and Unified Framework for Detecting and Localizing Video Anomalies. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2022; pp. 3982–3991. [Google Scholar]
  120. Li, J.; Huang, Q.; Du, Y.; Zhen, X.; Chen, S.; Shao, L. Variational Abnormal Behavior Detection With Motion Consistency. IEEE Trans. Image Process. 2022, 31, 275–286. [Google Scholar] [CrossRef] [PubMed]
  121. Hao, Y.; Li, J.; Wang, N.; Wang, X.; Gao, X. Spatiotemporal consistency-enhanced network for video anomaly detection. Pattern Recognit. 2022, 121, 108232. [Google Scholar] [CrossRef]
  122. Alafif, T.K.; Alzahrani, B.A.; Cao, Y.; Alotaibi, R.; Barnawi, A.; Chen, M. Generative adversarial network based abnormal behavior detection in massive crowd videos: A Hajj case study. J. Ambient Intell. Humaniz. Comput. 2022, 13, 4077–4088. [Google Scholar] [CrossRef]
  123. Shao, W.; Kawakami, R.; Naemura, T. Anomaly Detection Using Spatio-Temporal Context Learned by Video Clip Sorting. IEICE Trans. Inf. Syst. 2022, 105-D, 1094–1102. [Google Scholar] [CrossRef]
  124. Zou, B.; Wang, M.; Jiang, L.; Zhang, Y.; Liu, S. Surveillance Video Anomaly Detection with Feature Enhancement and Consistency Frame Prediction. In Proceedings of the International Conference on Multimedia and Expo Workshops, Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
  125. Zhou, W.; Li, Y.; Zhao, C. Object-Guided and Motion-Refined Attention Network for Video Anomaly Detection. In Proceedings of the International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
  126. Hu, X.; Lian, J.; Zhang, D.; Gao, X.; Jiang, L.; Chen, W. Video anomaly detection based on 3D convolutional auto-encoder. Signal Image Video Process. 2022, 16, 1885–1893. [Google Scholar] [CrossRef]
  127. Zhang, S.; Gong, M.; Xie, Y.; Qin, A.K.; Li, H.; Gao, Y.; Ong, Y.S. Influence-Aware Attention Networks for Anomaly Detection in Surveillance Videos. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5427–5437. [Google Scholar] [CrossRef]
  128. Wang, L.; Tan, H.; Zhou, F.; Zuo, W.; Sun, P. Unsupervised Anomaly Video Detection via a Double-Flow ConvLSTM Variational Autoencoder. IEEE Access 2022, 10, 44278–44289. [Google Scholar] [CrossRef]
  129. Liu, Y.; Liu, J.; Lin, J.; Zhao, M.; Song, L. Appearance-Motion United Auto-Encoder Framework for Video Anomaly Detection. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 2498–2502. [Google Scholar] [CrossRef]
  130. Feng, J.; Wang, D.; Zhang, L. Crowd Anomaly Detection via Spatial Constraints and Meaningful Perturbation. ISPRS Int. J. Geo Inf. 2022, 11, 205. [Google Scholar] [CrossRef]
  131. Cho, M.; Kim, T.; Kim, W.J.; Cho, S.; Lee, S. Unsupervised video anomaly detection via normalizing flows with implicit latent features. Pattern Recognit. 2022, 129, 108703. [Google Scholar] [CrossRef]
  132. Park, C.; Lee, M.; Cho, M.; Lee, S. RandomSEMO: Normality Learning Of Moving Objects For Video Anomaly Detection. arXiv 2022, arXiv:2202.06256. [Google Scholar]
  133. Le, V.T.; Kim, Y.G. Attention-based residual autoencoder for video anomaly detection. Appl. Intell. 2023, 53, 3240–3254. [Google Scholar] [CrossRef]
  134. Sharif, M.H. Laser-Based Algorithms Meeting Privacy in Surveillance: A Survey. IEEE Access 2021, 9, 92394–92419. [Google Scholar] [CrossRef]
  135. Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
  136. Iman, R.; Davenport, J. Approximations of the critical region of the Friedman statistic. Commun. Stat. Theor. M. 1980, 18, 571–595. [Google Scholar] [CrossRef]
  137. Kusetogullari, H.; Sharif, M.H.; Leeson, M.S.; Celik, T. A Reduced Uncertainty-Based Hybrid Evolutionary Algorithm for Solving Dynamic Shortest-Path Routing Problem. J. Circuits, Syst. Comput. 2015, 24, 1550067. [Google Scholar] [CrossRef]
  138. Hodges, J.; Lehmann, E. Ranks methods for combination of independent experiments in analysis of variance. Ann. Stat. 1962, 33, 482–497. [Google Scholar] [CrossRef]
  139. Quade, D. Using weighted rankings in the analysis of complete blocks with additive block effects. J. Am. Stat. Assoc. 1979, 74, 680–683. [Google Scholar] [CrossRef]
  140. Westfall, P.; Young, S. Resampling-based Multiple Testing: Examples and Methods for p-Value Adjustment; John Wiley and Sons: Hoboken, NJ, USA, 2004. [Google Scholar]
  141. Nemenyi, P. Distribution-Free Multiple Comparisons. Ph.D. Thesis, Princeton University, Princeton, NJ, USA, 1963. [Google Scholar]
  142. Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lucic, M.; Schmid, C. ViViT: A Video Vision Transformer. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6816–6826. [Google Scholar]
  143. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Figure 1. Generalized architecture (rpNet) of our proposed anomaly detection framework.
Figure 1. Generalized architecture (rpNet) of our proposed anomaly detection framework.
Electronics 12 01517 g001
Figure 2. Our proposed aU-Net.
Figure 2. Our proposed aU-Net.
Electronics 12 01517 g002
Figure 3. Two reconstruction networks: (a) convolutional autoencoder (ConvAE) and (b) ConvAE with skip connection (AEc).
Figure 3. Two reconstruction networks: (a) convolutional autoencoder (ConvAE) and (b) ConvAE with skip connection (AEc).
Electronics 12 01517 g003
Figure 4. At non-local mean, similar pixel neighborhoods obtain bigger weights.
Figure 4. At non-local mean, similar pixel neighborhoods obtain bigger weights.
Electronics 12 01517 g004
Figure 5. A space-time non-local block.
Figure 5. A space-time non-local block.
Electronics 12 01517 g005
Figure 6. Our adopted non-local block U-Net.
Figure 6. Our adopted non-local block U-Net.
Electronics 12 01517 g006
Figure 7. (a,b) demonstrate a camera view frame and its RoI marked red, respectively.
Figure 7. (a,b) demonstrate a camera view frame and its RoI marked red, respectively.
Electronics 12 01517 g007
Figure 8. Sample camera view frames of UCSD-Ped2 [31].
Figure 8. Sample camera view frames of UCSD-Ped2 [31].
Electronics 12 01517 g008
Figure 9. Sample output of Algorithm 1 using frames in Figure 8, d e A t t = 60 , and Γ = 0.195 .
Figure 9. Sample output of Algorithm 1 using frames in Figure 8, d e A t t = 60 , and Γ = 0.195 .
Electronics 12 01517 g009
Figure 10. Simplified structure of rpNet.
Figure 10. Simplified structure of rpNet.
Electronics 12 01517 g010
Figure 11. Simple simulation to show that the rpNet can guarantee larger error gap.
Figure 11. Simple simulation to show that the rpNet can guarantee larger error gap.
Electronics 12 01517 g011
Figure 12. Performance comparison of rNet, pNet, and rpNet deeming data in Table 2.
Figure 12. Performance comparison of rNet, pNet, and rpNet deeming data in Table 2.
Electronics 12 01517 g012
Figure 13. A sample output using UCSD-Ped1 [31], where a car anomaly was happened. (a,b) exhibit PSNR scores, whereas (c,d) show frame-level scores.
Figure 13. A sample output using UCSD-Ped1 [31], where a car anomaly was happened. (a,b) exhibit PSNR scores, whereas (c,d) show frame-level scores.
Electronics 12 01517 g013
Figure 14. A sample output using UCSD-Ped2 [31], where a bicycle anomaly was happened. (a,b) exhibit PSNR scores, whereas (c,d) show frame-level scores.
Figure 14. A sample output using UCSD-Ped2 [31], where a bicycle anomaly was happened. (a,b) exhibit PSNR scores, whereas (c,d) show frame-level scores.
Electronics 12 01517 g014
Figure 15. A sample output using CUHK-Avenue [32], where a person run anomaly was happened. (a,b) exhibit PSNR scores, whereas (c,d) show frame-level scores.
Figure 15. A sample output using CUHK-Avenue [32], where a person run anomaly was happened. (a,b) exhibit PSNR scores, whereas (c,d) show frame-level scores.
Electronics 12 01517 g015
Figure 16. A sample output using S.T.-Campus [18], where a bicycle anomaly was happened. (a,b) exhibit PSNR scores, whereas (c,d) show frame-level scores.
Figure 16. A sample output using S.T.-Campus [18], where a bicycle anomaly was happened. (a,b) exhibit PSNR scores, whereas (c,d) show frame-level scores.
Electronics 12 01517 g016
Figure 17. A sample output using UMN [36], where a sudden crowd panic and run was happened. (a,b) exhibit PSNR scores, whereas (c,d) show frame-level scores.
Figure 17. A sample output using UMN [36], where a sudden crowd panic and run was happened. (a,b) exhibit PSNR scores, whereas (c,d) show frame-level scores.
Electronics 12 01517 g017
Figure 18. Nemenyi [141] post hoc critical distance diagram for α = 0.05 using fAUC scores in Table 7 for G 2 .
Figure 18. Nemenyi [141] post hoc critical distance diagram for α = 0.05 using fAUC scores in Table 7 for G 2 .
Electronics 12 01517 g018
Figure 19. Nemenyi [141] post hoc critical distance diagram for α = 0.10 using fAUC numerics in Table 8 for G 3 .
Figure 19. Nemenyi [141] post hoc critical distance diagram for α = 0.10 using fAUC numerics in Table 8 for G 3 .
Electronics 12 01517 g019
Figure 20. Nemenyi [141] post hoc critical distance diagram for α = 0.05 using fAUC scores in Table 9 for G 4 .
Figure 20. Nemenyi [141] post hoc critical distance diagram for α = 0.05 using fAUC scores in Table 9 for G 4 .
Electronics 12 01517 g020
Figure 21. Nemenyi [141] post hoc critical distance diagram for α = 0.10 using fAUC scores in Table 10 for G 5 .
Figure 21. Nemenyi [141] post hoc critical distance diagram for α = 0.10 using fAUC scores in Table 10 for G 5 .
Electronics 12 01517 g021
Figure 22. Nemenyi [141] post hoc critical distance diagram for α = 0.10 using fAUC numerics in Table 11 for G 6 .
Figure 22. Nemenyi [141] post hoc critical distance diagram for α = 0.10 using fAUC numerics in Table 11 for G 6 .
Electronics 12 01517 g022
Table 1. A qualitative comparison of the most related works. MSE: Mean Square Error, PSNR: Peak Signal to Noise Ratio.
Table 1. A qualitative comparison of the most related works. MSE: Mean Square Error, PSNR: Peak Signal to Noise Ratio.
ReferenceReconstructionPrediction NetworkEmployedUsed Crowd Dataset
NetworkGeneratorOptical FlowScore
Liu et al. [2]Not applicableU-Net [7]Flownet [9]PSNRPed1 [31], Ped2 [18,31,32]
Nguyen et al. [27]ConvAEConvAE + U-Net [7] with skip connectionsFlowNet2 [33]MSEPed2 [31,32], etc.
Zhong et al. [3]Traditional AESE module [34]Output of SE module [34]MSEPed1 [31], Ped2 [18,31,32]
Liu et al. [28]AEU-Net [7] + HDC [29] + DB-ConvLSTM [30]Difference of RGB [35]PSNRPed1 [31], Ped2 [31,32]
Zhang et al. [13]Not applicableU-Net [7] + Non-local block [11]Flownet [9]PSNRPed1 [31], Ped2 [18,31,32]
AE-Unet (Ours)ConvAEU-Net [7]Flownet [9]PSNRPed1 [31], Ped2 [18,31,32,36]
AEcUnet (Ours)ConvAE with skip connectionU-Net [7]Flownet [9]PSNRPed1 [31], Ped2 [18,31,32,36]
AEnUnet (Ours)ConvAEU-Net [7] + Non-local block [11]Flownet [9]PSNRPed1 [31], Ped2 [18,31,32,36]
AEcnUnet (Ours)ConvAE with skip connectionU-Net [7] + Non-local block [11]Flownet [9]PSNRPed1 [31], Ped2 [18,31,32,36]
AEaUnet (Ours)ConvAEU-Net [7] + Proposed attention block + Proposed Motion Saliency MapFlownet [9]PSNRPed1 [31], Ped2 [18,31,32,36]
AEcaUnet (Ours)ConvAE with skip connectionU-Net [7] + Proposed attention block + Proposed Motion Saliency MapFlownet [9]PSNRPed1 [31], Ped2 [18,31,32,36]
Table 2. Qualitative and quantitative analysis of the simulated normal and abnormal video events in Figure 11.
Table 2. Qualitative and quantitative analysis of the simulated normal and abnormal video events in Figure 11.
MeasuresWalk, Gather, EvacuateWalk, Sudden SplitOpposite FlowSudden Run
rNetpNetrpNetrNetpNetrpNetrNetpNetrpNetrNetpNetrpNet
Ground truth frame start ( g s )305305305629629629907907907974974974
Ground truth frame end ( g e )328328328685685685948948948100010001000
First detected abnormal frame ( f d )317311302645636625928915903982978975
Last detected abnormal frame ( l d )3253263286576696909289439509889911000
Number of false positive frames ( f p )003004004001
Number of true positive frames ( t p )815261233651284761325
Number of false negative frames ( f n )1583442394113620131
Number of true negative frames ( t n )477477468204204182158159143141413
Sum ( T f = t p + t n + f p + f n )500500500260260260200200200404040
Recall Rate ( t p / ( t p + f n ) )0.3480.6520.8960.2140.5890.8780.0240.6830.8870.2310.5000.961
Specificity ( t n / ( t n + f p ) )110.994110.978110.973110.929
False positive rate = (1 − Specificity)000.006000.021000.027000.071
Precision rate ( t p / ( t p + f p ) )110.897110.942110.922110.961
Accuracy ( A C C = ( t p + t n ) / T f )0.9700.9840.9880.8310.9110.9500.7950.9350.9500.5000.6750.950
RMSE at rising edge ( Γ r ) for rNet 905 4 15.0416 using Γ r = 1 4 i = 1 4 ( g s ( i ) f d ( i ) ) 2
CV( Γ r ) at rising edge for rNet 15.0416 2815 0.0053 using C V ( Γ r ) = Γ r 1 4 i = 1 4 g s
RMSE at falling edge ( Γ f ) for rNet 1337 4 18.2825 using Γ f = 1 4 i = 1 4 ( g e ( i ) l d ( i ) ) 2
CV( Γ f ) at falling edge for rNet 18.2825 2961 0.0062 using C V ( Γ f ) = Γ f 1 4 i = 1 i g e
Γ r and CV( Γ r ) for pNet 165 4 6.4226 and 6.4226 2815 0.0023
Γ f and CV( Γ f ) for pNet 366 4 9.5656 and 9.5656 2961 0.0032
Γ r , CV( Γ r ), Γ f , CV( Γ f ) for rpNet 42 4 3.2404 , 3.2404 2815 0.0012 , 29 4 2.6926 , 2.6926 2961 0.0009
ROC curve analysis AUC 0 . 6739 for rNet, AUC 0 . 8415 for pNet, AUC 0 . 9731 for rpNet
Mean ACC gain obtained by rpNetrpNet was 0.9595 0.7740 1 = 23 . 97 % , 0.9595 0.8762 1 = 9 . 51 % ,
and 16 . 74 % more accurate over rNet, pNet, and their mean ACC, respectively.
AUC gain obtained by rpNetrpNet performed 0.9731 0.6739 1 = 44 . 40 % , 0.9731 0.8415 1 = 15 . 64 % ,
and 30 . 02 % better than rNet, pNet, and their mean AUC, respectively.
Table 3. Comparison of various specifications of crowd datasets and their available web links. H height, W width.
Table 3. Comparison of various specifications of crowd datasets and their available web links. H height, W width.
DatasetSourceCountingVideosAnnotationNumber of FramesAnomaly EventsDataset Link
SceneDurationAnomalyTrainingTestingTotal H × W of FrameUsingLevelCountTrainingTestingTotal
UCSD-Ped1 [31] (2008)Using 1st outdoor surveillance camera.55 min40343670 158 × 238HumanPixelNA6800720014,000Circulation of non-pedestrian entities in the walkways and anomalous pedestrian motion patterns.http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm, (accessed on 2 January 2023)
UCSD-Ped2 [31] (2008)Using 2nd outdoor surveillance camera.55 min12161228 240 × 360HumanPixelNA255020104560Circulation of non-pedestrian entities in the walkways and anomalous pedestrian motion patterns.http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm, (accessed on 2 January 2023)
UMN [36] (2009)Synthetic dataset [76,77].34.3 min11UnavailableUnavailable11 480 × 640 240 × 320SoftwareTemporalUnavailableUnavailableUnavailable7725In each video crowd walks randomly. Everyone starts running suddenly, which is marked as an anomaly.http://mha.cs.umn.edu/proj_events.shtml#crowd, (accessed on 2 January 2023)
CUHK-Avenue [32] (2013)Captured in CUHK campus avenue.530 min47162137 360 × 640HumanPixel FrameNA15,32815,32430,652Strange actions, wrong directions, unexpected objects. The ground truth of anomalous objects is marked by a rectangle.www.cse.cuhk.edu.hk/leojia/projects/detectabnormal/dataset.html, (accessed on 2 January 2023)
Shanghai Tech Campus [18] (2017)University campus, surveillance cameras.13NA130238199437 2048 × 2048HumanPixelNA27451542,883317,398Anomalies caused by sudden motion, including chasing and brawling.https://svip-lab.github.io/dataset/campus_dataset.html, (accessed on 2 January 2023)
Table 4. List of parameters and their used values.
Table 4. List of parameters and their used values.
DatasetValue of ParametersRatio of
λ int p λ gd p λ int r λ gd r λ mot λ adg Mean Γ d e A t t ω 1 ω 2 σ 1 σ 2 ν λ η TrainingTesting
Ped1 [31]1.051.031.051.051.900.050.357500.9310.869110.5151.615249%51%
Ped2 [31]1.051.101.061.051.850.050.195600.9260.851110.7151.515256%44%
Avenue [32]1.091.191.021.122.130.050.114450.9020.813110.5051.365250%50%
Campus [18]1.071.041.051.052.190.050.428550.9420.877110.3551.125285%15%
UMN [36]1.021.051.081.102.070.050.126750.9450.863110.6051.450360%40%
Table 5. Frame-level AUC score comparison of miscellaneous methods and datasets. Column-wise the best numerical result is shown in bold.
Table 5. Frame-level AUC score comparison of miscellaneous methods and datasets. Column-wise the best numerical result is shown in bold.
YearModelsVarious Popular Crowd Datasets
Ped1 [31]Ped2 [31]Avenue [32]Campus [18]UMN [36]
Before 2020Liu et al. [2]0.8310.9540.8490.728-
Hasan  et al. [1]0.7500.8500.8000.609-
LuoLG [78]0.7550.8810.770--
Luo  et al. [18]-0.9220.8170.680-
Nguyen  et al. [27]-0.9620.869--
Ionescu et al. [22]0.6840.8220.806--
2020WangCYJT [79]0.8340.9630.8830.766-
Chen et al. [80]0.8720.9650.873--
Dong et al. [81]-0.9560.8490.737-
Fan et al. [82]0.9490.9220.834--
Nawaratne et al. [83]0.7520.9110.768--
Wang et al. [84]0.8670.9910.899--
WuLLSS [85]0.8240.9280.855--
Yang et al. [86]0.9350.9370.832--
Zahid et al. [87]0.5850.7890.7500.940-
Zhou et al. [88]0.8390.9600.860--
Doshi et al. [89]-0.9780.8640.716-
Pang et al. [90]0.7200.830--0.993
Roy et al. [91]0.8500.9750.8700.8100.997
Wu et al. [92]0.8400.924--0.993
Ji et al. [93]0.8400.9800.780--
Lu et al. [94]0.8630.9620.8580.779-
Ramachandra et al. [95]0.8600.9400.872--
Tang et al. [96]0.8300.9600.8400.72-
Almazroey et al. [97]0.9370.8330.875--
Wu0S [98]0.8300.9600.870-0.890
Lee et al. [99]-0.9660.9000.7620.996
Prawiro et al. [100]0.8400.9600.860--
Song et al. [101]0.9050.9070.8920.700-
Yan et al. [102]0.7500.9100.796--
2021Sun et al. [103]0.9020.9100.8890.922-
Xia et al. [104]0.8800.9660.922-0.970
Feng et al. [105]-0.9700.8600.777-
Zhang et al. [106]-0.9540.8680.736-
Wu et al. [107]0.8850.9880.8470.728-
Vu et al. [108]0.8500.9600.9200.937-
Mu et al. [109]0.9520.9470.8970.921-
LiLS [110]0.8530.9550.8910.740-
LiCL [111]0.9050.9290.835-0.980
Cai et al. [112]-0.9680.8730.742-
Saypadith et al. [113]0.8530.9570.8680.730-
Doshi et al. [26]-0.9720.8640.709-
Luo et al. [114]-0.9220.8350.696-
Gutoski et al. [115]0.7190.8930.847-0.992
2022Zhong et al. [3]0.8260.9770.8890.707-
Chang  et al. [116]-0.9670.8710.737-
Esquivel et al. [117]0.7100.8700.8300.870-
Park et al. [118]-0.9600.8500.720-
Doshi et al. [119]-0.9700.8870.736-
Li et al. [120]0.8120.9710.8660.782-
Hao et al. [121]0.8250.9690.8660.738-
Zhang et al. [13]0.8360.9590.8520.727-
Alafif et al. [122]0.8280.957--0.981
Shao et al. [123]0.7760.9490.8530.717-
Zou et al. [124]-0.9730.8720.727-
Zhou et al. [125]-0.9740.9260.749-
Hu et al. [126]0.8070.8530.810--
Zhang et al. [127]0.9420.9290.8050.8030.988
Wang et al. [128]0.8800.8900.870--
Liu et al. [129]-0.9810.8980.738-
Feng et al. [130]0.8360.9080.813--
Cho et al. [131]-0.9920.8800.763-
ParkLCL [132]-0.9580.8540.724-
Le et al. [133]-0.9740.8670.736-
Liu et al. [28]0.8510.9660.865--
2023AE-Unet (Ours)0.8480.9020.8250.7340.930
AEcUnet (Ours)0.8620.9340.8630.7610.965
AEnUnet (Ours)0.8720.9570.8710.7740.977
AEcnUnet (Ours)0.8880.9710.8740.7820.976
AEaUnet (Ours)0.8750.9690.8870.7800.980
AEcaUnet (Ours)0.9180.9890.9160.7980.987
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sharif, M.H.; Jiao, L.; Omlin, C.W. Deep Crowd Anomaly Detection by Fusing Reconstruction and Prediction Networks. Electronics 2023, 12, 1517. https://doi.org/10.3390/electronics12071517

AMA Style

Sharif MH, Jiao L, Omlin CW. Deep Crowd Anomaly Detection by Fusing Reconstruction and Prediction Networks. Electronics. 2023; 12(7):1517. https://doi.org/10.3390/electronics12071517

Chicago/Turabian Style

Sharif, Md. Haidar, Lei Jiao, and Christian W. Omlin. 2023. "Deep Crowd Anomaly Detection by Fusing Reconstruction and Prediction Networks" Electronics 12, no. 7: 1517. https://doi.org/10.3390/electronics12071517

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop