Next Article in Journal
The Challenges of Machine Learning and Their Economic Implications
Previous Article in Journal
Nonequilibrium Thermodynamics in Biochemical Systems and Its Application
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Application of Improved Asynchronous Advantage Actor Critic Reinforcement Learning Model on Anomaly Detection

1
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
2
Institute for Computer Application, China Academy of Engineering Physics, Mianyang 621900, China
*
Authors to whom correspondence should be addressed.
Entropy 2021, 23(3), 274; https://doi.org/10.3390/e23030274
Submission received: 10 February 2021 / Revised: 21 February 2021 / Accepted: 22 February 2021 / Published: 25 February 2021
(This article belongs to the Section Multidisciplinary Applications)

Abstract

:
Anomaly detection research was conducted traditionally using mathematical and statistical methods. This topic has been widely applied in many fields. Recently reinforcement learning has achieved exceptional successes in many areas such as the AlphaGo chess playing and video gaming etc. However, there were scarce researches applying reinforcement learning to the field of anomaly detection. This paper therefore aimed at proposing an adaptable asynchronous advantage actor-critic model of reinforcement learning to this field. The performances were evaluated and compared among classical machine learning and the generative adversarial model with variants. Basic principles of the related models were introduced firstly. Then problem definitions, modelling processes and testing were detailed. The proposed model differentiated the sequence and image from other anomalies by proposing appropriate neural networks of attention mechanism and convolutional network for the two kinds of anomalies, respectively. Finally, performances with classical models using public benchmark datasets (NSL-KDD, AWID and CICIDS-2017, DoHBrw-2020) were evaluated and compared. Experiments confirmed the effectiveness of the proposed model with the results indicating higher rewards and lower loss rates on the datasets during training and testing. The metrics of precision, recall rate and F1 score were higher than or at least comparable to the state-of-the-art models. We concluded the proposed model could outperform or at least achieve comparable results with the existing anomaly detection models.

1. Introduction

Anomalies, also called outliers, exceptions or peculiarities are patterns in the data that do not conform to the expected behavior. Types of anomalies include point, contextual and collective ones which are classified based on the single data, context and relationships among collection of data, respectively. Anomaly detection [1] is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. It has been proven critical to many applications such as network intrusion of recognizing potential cyber-attacks, credit card fraud, medical applications where electrocardiography or other biological data are monitored to detect the patients’ situation and video surveillance of identifying the suspicious movements.

1.1. Traditional Methods for Anomaly Detection

Anomaly detection research was once conducted by the statisticians using mathematical methods. Numerous algorithms such as extreme value, probabilistic and statistics, linear, spectral and proximity based were proposed in the past and achieved good performances. The common point of these methods is to quantify the deviations from the normal patterns for data points with a numerical score. The likelihood fit of data points to the model is the outlier score in probabilistic modeling, or density value for proximity-based modeling. The residual distance of data points to a lower dimensional data representation is the outlier score in linear modeling. In temporal modeling, a function of the distance from previous data points (or deviation from forecasted value) is used to create the outlier score. Typical models as listed in Table 1 include one-class Support Vector Machine (SVM), Isolation Forest (IF) and Local Outlier factor (LOF).
Table 1. Anomaly detection classification.
Table 1. Anomaly detection classification.
ClassificationRepresentative ModelType
Classifier basedOne-Class SVM [2],
unsupervised NN
Unsupervised
Machine
Learning
Nearest neighborsKNN [3], LOF [4], COF [5], …
Clustering basedCBLOF [5], LDCOF [5], …
Statistical basedHBOS [5], …
Subspace basedrPCA [5], …
Ensembles and
Combination
Isolation Forest [6],
FeatureBagging [7]
Ensembles
Neural NetworksAuto-encoder with FNN,
LSTM [8], …
Deep Neural
Network
Generative modelGAN [9], …
OthersInformation theory basedOthers
Spectral decomposition based
Visualization based
Reinforcement Learning
Although these methods were effective in detecting the anomalies, they also suffered from oversimplifying assumptions about data representations and poor algorithms scalability. They make different assumptions about the ’normal’ behavior which depend highly on data patterns of particular domains. Anomaly detection tasks become more challenging because of the insufficient knowledge and representative of the anomalies for specific systems. They faced the challenges of data labelling or baseline ’normal’ requiring domain knowledge and the property of time-evolving makes detection even harder. Furthermore, different types of data exhibit different abnormal behavior. For example, sequential anomalies are objects or instances over time steps which are obviously different to image or video surveillance. Large amount of data due to the pervasiveness of data-collecting devices are often unlabeled which pose computational challenges and thus require novel and efficient approaches.
In supervised learning there are theories and experimental evidences trained with stochastic gradient descend that local minima problems might be resolved. Although unsupervised, semi-supervised and supervised learning achieved certain successes in the anomaly detection field, the weaknesses of the three are also blatant. Unsupervised learning with no pre-classified labels assumed most of the data is normal while outliers or smaller clusters are anomalies. High-dimensional data is typically sparse there could be many small clusters which are not necessarily outliers. Although semi-supervised labels for normal data allows accurate modelling of normal behavior, false alarm rate due to unseen but legitimate instances are still high. Supervised with labels for normal data and known anomalies cannot detect unknown and emerging anomalies.

1.2. Deep Reinforcement Learning for Anomaly Detection

Recent years have seen great progresses in the development of AI especially the Deep Learning (DL) and Reinforcement Learning (RL) especially when Alphago by Deepmind using RL beat the human champion. Thereafter the application of RL in diverse domains emerged. Introducing DL and RL into anomaly detection and taking advantages of the two great powers to achieve even better performances has become the trend. Inspired by the successful application of RL to many tasks, we proposed the model for anomaly detection based on A3C reinforcement learning (A3C) with adaptable deep neural network. A3C learns policy and state value functions simultaneously and the reward function incentivizes the agent to collect as much reward as possible. For sequential anomalies attention based NN is proposed as the value function estimator of the actor and CNN based models for video/images tasks. DNN is employed to represent a reward function to detect whether a new observation follows a normal pattern. The objective is to establish the appropriate RL models to capture the objects or instances with different properties that deviate from normal behavior. Our proposed method can adapt to different domain tasks for better performances.
RL is different from supervised and unsupervised learning setup in that (1) the function are typically not independent identical distribution, (2) agents affect the data they will receive and (3) the feedback can be sparse and are always delayed. The modelling process and samples for training and testing of RL are different to supervised learning where labeled datasets are available. Further, the data distributions are often non-stationary which depend on how agents taking actions. The agents basically decide the data they want see which enhance the complexities of training.
Local minima problems persist for RL algorithms when the policy converges early to some deterministic points. One possible solution is to introduce the exploration and exploitation which is realized in our model using the entropy method. Deep RL methods face challenges of solving large control and prediction problems where states may exceed 10 170 for the chess GO and even larger for infinite continuous state space. Value function evolved from lookup tables where states and state-action pair have entries to NN. There are too many states and/or actions to store in memory and too slow to learn the value of each state individually, also the states are often not fully observable. The proposed asynchronous global and workers model attempted to enhance the efficiency by utilizing the power of parallel computing.

2. Related Researches

We surveyed three types of papers, namely review, general purpose RL and RL applied to anomaly detection. Comprehensive surveys on network anomaly detection including algorithms, experiments and analyses were done in [10,11,12,13]. Deep Learning for anomaly detection were surveyed in [14,15]. Researches were on sequential anomaly detection using RL [16,17].
States, actions and rewards based on actions were defined using RL algorithms such as dueling DQN, A3C with the goals to optimally assign resources in [18,19,20]. To find falsifying inputs for CPS, states, actions and rewards were defined based on inputs, outputs and function of past-dependent output signal, respectively, using Double DQN and A3C algorithms [21]. Authors proposed PMU-RL method [22] to balance energy power consumption and efficiency which showed promising results for heterogeneous computing platforms. We summarized papers [23,24,25,26] as that states were defined according to the systems status, agents took certain actions and rewards were fed back to the agents. The goals are to maximizing the rewards to get better performances using RL related algorithms such as DQN, A3C, TRPO. To detect malicious websites (URLs), states are defined as vector space representation of website features such as HTTPS protocols, IP address, prefix or suffix in URLs, actions are 0 or 1 (0 for benign, 1 phishing URL) [27]. The goal is to get the most rewards based on action using DQN algorithm. Resource optimization of smart grids, motor anomalies and biological data with RL was addressed in [28,29,30], respectively. Network intrusion detection based on RL were proposed in [31,32,33,34]. Although all these paper claimed effectiveness of their general purpose RL models, more independent replays and extensive comparisons among SOTA model could be more persuasive.
Methods or frameworks were proposed for anomaly detection in [35,36,37]. Auto-encoders were used in the anomaly detections which achieved promising performances [38,39,40]. More extensive comparisons with SOTA models could be done to futher support the efficacies of their models. Anomaly detection based mainly on LSTM models were proposed in [41,42,43,44]. Researches suggest LSTM may be replaced by attention models so these authors should consider better modelling using attention such as transformers. Generative methods such as GAN were proposed to anomalies detection in [45,46]. Emphatic approach to the problem of off-policy temporal-difference learning was given in [47]. Comparison with exisiting models could be done extensively especially for novel approaches. Authors proposed methods and tested on the dataset of CICIDS with the precision of 99.1% in [48]. Weighted Extreme Learning Machine (ELM) method was proposed for the intrusion detection and the results showed precision of around 99% in [49,50]. More benchmark datasets tests could be done and more metrics should be used to corroborate their conclucions. The A2C were proposed for the sequential anomaly detection task in [51,52]. DQN algorithm suffers from overestimations which was not previously known was confirmed in [53]. Their adaptation method reduced the overestimation situation and improved the performances. As we know A3C might improve the performaces compared to A2C, so the paper should provide more evidences that their models based on A2C were superior to A3C.
All authors claimed test results of their proposed approaches outperformed existing models in their respective domains. There are at least two problems which were not addressed satisfactorily by these solutions. The NN structures of RL framework (especially the NNs for state values and the actions) were not clear or just employing the original fully connected NNs. Furthermore, comparison of application performances using A3C based on RL with GAN models were not sufficient. This paper therefore aimed at proposing appropriate NN structures of RL for specific domains and compared the performances of state of the art A3C of RL with the GAN in anomaly detection.
Contributions of this paper were summarized as follows:
1
We elaborated multiple related DL and RL methods’ basic principles, modelling process and made a comparative introduction on the strength and weakness of individual methods.
2
We proposed the improved RL architecture of A3C framework using appropriate policy and value DNNs for specific domains. For sequential tasks the RNN structure with attention mechanism for state value and CNN for images were proposed.
3
Performances of the proposed anomaly detection models were compared with machine learning models, GAN and with RL model variants. The empirical tests on three benchmark datasets showed that our proposed method outperformed or at least was comparable to the SOTA models in detecting anomalies.
In Section 1, we briefly introduced the background information. Related researches were reviewed in Section 2. Methodologies were presented in Section 3. We tested our models using experiments in Section 4. In Section 5, we presented conclusions and discussed possible future developments.

3. Preliminaries

3.1. Reinforcement Learning

We briefly introduced some key concepts of agent, environment, state, policy, reward and state/action value function.
The key success factors of RL applied to the anomaly detection are the problems definitions (the environments, states, actions and rewards), appropriate modeling architectures and the training samples for the respective domains. In RL agents sense the environment in discrete time steps and map the inputs to state information. RL agents execute actions and observe the feedback from the environment in the form of positive or negative rewards. Agents first act based on states and receive a reward, then observe state changes in the environment and update the policy to optimize the potential reward (see Figure 1 for the interaction between agent and environment). The optimal policy’s goal is to find the one that maximizes the rewards obtained over the time.
Agents could be classified as value based without policy, policy without value function, and actor-critic with both policy and value function. State of the art agents include Deep Deterministic Policy Gradient (DDPG), DQN with memory replay and target network, A2C, A3C. Policies define the behavior of agents and can be viewed as mapping function from states to actions. The policy is modeled as p(a|s) meaning the probability of agents taking action a under given state s. Policies could be realized in NN as the states and actions are the input and output, respectively. The NN models map from states/observations and actions to future states/observations. The state-value function of an MDP (in Equation (1) [54]) is the expected return starting from state s and then following policy π ( G t is the total reward from state s on to the end of episode, γ is the decaying factor):
v π ( s ) = E π [ G t | S t = s ] = E π [ R t + 1 + γ v π ( S t + 1 ) | S t = s ] = a π ( a | s ) s , r p ( s , r | s , a ) [ r + γ v π ( s ) ]
MDP under fixed policies becomes MRP which can be solved using matrix computation. Fixed means the reward is the function of pair (state, action) but not the probability distribution. The value functions are expressed as in Equation (2) [54]. Given an MDP M = (S,A,P,R, γ ) and a policy π , the state sequence S 1 , S 2 , … is a Markov process (S, p π ) and the state reward sequence S 1 , R 2 , S 2 , … is a MRP (S, p π , R π , γ ).
v π ( s ) = a A π ( a | s ) { r ( s , a ) + γ s S T ( s | s , a ) v π ( s ) } = r s π + γ s S T s s π v π ( s )
where r π s stands for reward at state s of the agent π , T s s π is the state transition matrix from state s to s’ under policy π . The value function can be expressed using matrix form v π = r π + γ T π v π , where v π is a vector and T π stands for the transition matrix. The iterative method is used to solve the matrix. The State-action value function q π ( s , a ) (Equation (3) [54]) is the expected return starting from state s, taking action a, and then following policy π .
q π ( s , a ) = E π [ G t | S t = s , A t = a ] = E π [ R t + 1 + γ q π S t + 1 , A t + 1 ) | S t = s , A t = a ]
Figure 2 illustrated the computation process of v π ( s ) and q π ( s , a ) . The expectation is often simplified as the random reward in practice and then get Equation (4) [54].
q π ( s , a ) = s , r p s , r | s , a r + γ a π a | s q π s , a
Policy gradient is employed to minimize the loss of policy using stochastic gradient and is expressed in Equation (5) [54].
R θ ¯ = 1 N n = 1 N t = 1 T n t = t T n γ t t r t n b l o g P θ ( a t n | s t n )
where γ is the time decaying factor, b is baseline represented by V π s t n , r t n represents the reward from time step t to the end of episode. The randomness is introduced due to the omission of expectation E, Q π s t n , a t n = E r t n + V π s t + 1 n . The gradient becomes the following approximation (Equation (6) [54]) in which sampling methods are used.
R θ ¯ = 1 N n = 1 N t = 1 T n r t n + V π ( s t + 1 n ) V π ( s t n ) l o g P θ ( a t n | s t n )
Target network (Figure 3) in our model consists of two networks whose parameters are the same initially. The parameters of target network are fixed to generate the fixed value for the network training with regression, and after training the target network’s parameters are then shared. The sampling work consume much time and we used the replay buffer in which large amount of history samples are stored to counter this challenge.
The DDQN network structure differ from DQN in that the target network is proposed to avoid the problem of moving training object. Initially the parameters of target network Q ^ are copied from the current network of Q and then the target network parameters are froze to generate the expected q value r t + Q π ( s t + 1 , π ( s t + 1 ) ) . The training object is to minimize the difference between the Q π ( s t , π ( s t ) ) and r t + Q π ( s t + 1 , π ( s t + 1 ) ) using stochastic gradient descent.
Actor-Critic network structures combine the DDQN (actor) and the state value function (critic). Critic in Q-learning does not directly determine the action, it evaluates the goodness of the actor with the cumulated rewards after visiting states s. When the state s is input to actor V π , the output scalar V π S a represents the value of the state. The main methods of evaluating V π S a are the MC and TD approaches. The results differ due to different sampling which MC samples complete trajectories and their returns while TD looks several steps ahead. MC has an unbiased mean and large variance while TD has small variance and biased mean. Importance sampling avails changes from on-policy to off-policy when the reacting actor’s policy is hard to estimate. Agent learned and interacted with environment is the same for on-policy and different for off-policy. Proximal Policy Optimization [18] and Trust Region Policy Optimization [18] ensure the parameters of demonstration policy are not far from the reacting actor in the off-policy scenario. Model-free RL of q-learning uses the target network function to train and approximate the q function. Model based such as Bayesian model-based reinforcement learning or Bayes-adaptive RL. The policy gradient descend is expressed in Equation (7) [54].
J P P O θ = J θ ( θ ) β K L ( θ , θ ) w h e r e J θ ( θ ) = E ( s t , a t ) π θ [ p θ ( a t | s t ) p θ ( a t | s t ) A θ ( s t , a t ) ]
where the KL divergence of θ , θ stands for the behavior difference and β is used as the regularizer to control the difference.
The difference between A2C and A3C is that A3C is asynchronous. A3C consists of several independent workers with their own weights who interact with a copy of the environment in parallel. Thus, they can explore a bigger state-action space in much less time in theory. The workers are trained in parallel and update periodically the global at an “asynchronous” time, which holds shared parameters. They explored the environment, calculated individual updates, and sent the updates back to the global. When a worker sends back an update, the global updates itself and then the worker. The workers align their parameters to the global and are synchronized with the freshly updated global after each update. Parameters information flows from the workers to the global and between workers as each worker resets weights according to the global.
We proposed using A3C as our architecture where actor and critic NNs are similar to the generator and discriminator in GAN, respectively. Actor realized the policy function and the critic NN realized the state value function, respectively. The training object is to minimize the summed losses of these two combined functions. Both functions are trained in a single iteration.
Different to existing design of actor and critic using the fully connected NN, the more appropriate NN to function as actor and critic are proposed. For example, CNNs are fit for images tasks and RNNs are for sequential tasks traditionally so the LSTM is considered as the actor. However, the recent NN with attention mechanism achieved remarkable successes and some research [18] even claimed the LSTM should be replaced by attention mechanism. In our work, therefore, we designed TCN with attention mechanism based anomaly detector trained with the actor-critic algorithm of RL to detect the anomalies for sequential tasks. The architectures of NN of the detector could be adapted to the corresponding tasks for better performances. CNNs were employed as the detector for the images related anomalies, and TCN with the attention mechanism was employed for sequential tasks.

3.2. Generative Adversarial Learning

There were anomaly detection studies done based on GAN and our proposed model share similarity with GAN [55], the GAN architecture (see Figure 4) for anomaly detection was therefore designed to compare the performances with RL methods. The CNN + LSTM architecture was constructed for the generator and the MLP was for the discriminator. The Generator (G) and Discriminator (D) were trained using the two-player minmax loss game defined as in Equation (8).
L D , G = E Z p z ( Z ) log 1 D G Z + E X p x ( X ) [ l o g D ( X ) ]
G’s objective is to learn the distribution of the real data by increasing the error rate of D to fool D to think that the generated data are real. G tries to minimize E z p z ( z ) [ l o g ( 1 D ( G ( z ) ) ) ] while D to maximize E x p r ( x ) [ l o g D ( x ) ] .

4. Materials and Methods

4.1. Definitions

The anomaly detector ‘d’ observed one state Y t at time t and the detecting can be modeled as a partially observable Markov decision process (POMDP) in which not all states can be observed. In our methods the environments were sensed by agents through observational spaces. Anomaly detector d is the agent in RL, the action ‘a’ which d will take under policy π is defined as the probability distribution (Stochastic policy) π := p(A|S) given the state s, where S and A denote the sets of states and actions, respectively. The policy was modeled as the NN with parameter θ where the inputs were the states and the outputs were the action distribution. In our task, the binary classification of {0, 1} was given in which 1 suggests anomaly and 0 otherwise. Action space is denoted as the possibilities of the actions d will take and then get a reward from. For example, action space in the KDD intrusion environment is {‘Probe’, ‘DoS’, ‘U2R’, ‘L2R’, ‘Normal’}. The goal is to obtain the highest reward in the finite time. At time step t, the selected action ( a ^ t ) together with the ground-truth action ( a t ) is compared by the reward function to get 1 or 0. If the ( a ^ t ) matches ( a t ), instant reward ( r t ) of 1 is given, otherwise the reward is 0. The performance of V π (Value function in RL) is formalized as:
V π = s S w π ( s ) a A R ( s , a ) π ( s , a )
where w π s is the probability of the system in the state s, R(s, a) represents the cumulated reward starting from state s and action a. The performance is the expected cumulated reward following π . V π was realized as the NN in which the current state and action were input to and the network returned the value of the state from the perspective of the whole episode. The optimal performance π satisfies: π = a r g m a x π V π which means the best performances. π is improved consistently by learning from the experience with the goal of gaining better rewards.
For time series anomaly detection (see Figure 10), the sliding window mechanism ( t i , t i + 1 , t i + n ) was used to get the rewards instead of the complete time series. The detector acted according to the policy and was rewarded for every window. After the sliding window swept through the series the rewards were summed up. The red rectangle in the following figure represents the sliding window and the detector acted and got the rewards. If the anomaly was correctly detected judged by the label, the larger rewards (in our tests five point for reward) were obtained and the goal of maximizing the summed rewards then guided the training of the parameters of the policy and value network.

4.2. Anomaly Detector Architecture

A2C consists of two independent NNs whose neurons are not shared and are parameterized by θ . A2C improved the value and policy based algorithm by having the critic to learn advantage values which could be interpreted not only the goodness of actions but also room for improvement. A3C proposed by DeepMind [56] further improves A2C in that the former provides the asynchronous functionality and each worker interact with a different copy of the environment in parallel and thus is faster and more flexible. A3C consists of a global agent and several individual workers illustrated in Figure 5. Both global agent and individual learners are modeled as DNNs with each having two outputs: one for the critic and another the actor. The first output is a scalar representing the expected reward of a given state V(s) and the second is a vector representing a probability distribution over all possible actions (s,a).
The proposed A3C framework as the anomaly detector could increase the detection efficiency through utilizing the computing capability of multi-processors. Most of the current work employed basic NNs (for example the feedforward neural network) for the actor and critic structures. We proposed that the design of NNs’ architectures should adapt to the targeted problems. If the anomaly detection of images related area is targeted, the CNN based structure is more appropriate while for sequential task the attention mechanisms is used (see Figure 6 for the two NN architectures). Anomaly detector architecture consists of one global network and several worker agents (12 workers are realized in our test). The cyclic training process involves five steps. (1) Each worker initially reset to global network, (2) interacts with respective environments, (3) estimates value and policy losses, (4) calculates gradient from losses and gradients of all workers are averaged to update the global neural network weights, (5) workers go back to step (1) to carry out another training until convergence or time maximum reached.
Problem definitions such as the environment, system states, action spaces, rewards, states transition, state/value functions, play the vital role in the specific RL modelling task. Generally, problem definitions depend on specific tasks. For the test dataset we defined the environment, states, actions, the dynamics of the states transitions, rewards, and we proposed using appropriate state/action value NN under the A3C framework for different tasks. The algorithm for training anomaly detection based on A3C model was presented as follows.
For each worker neural net samples the batches s t 1 a t 1 s t 1 + 1 , s t i a t i s t i + 1 … from the database and then the states are fed into the policy NN (parameter θ for worker, θ for global) and state value NN( θ v for worker, θ v for global). The difference between the estimated value function R t and the actual value function V t (advantage value A t = R t V t ) is used to train the NN approximator of policy function. Increase the probability of action a t n if positive advantage is returned and else decrease that probability for the negative advantage. Each worker updates its own θ and θ v and these parameters are then submitted to global. The derivative to global parameter d θ , d θ v are computed on the average of each work θ , θ v to prevent the overestimate of some actions. The entropy of policy ( H π a j | s j ; θ is added to encourage exploration and avoid early suboptimal convergence.
The neural sampler provides batches of training samples s t 1 a t 1 s t 1 + 1 , s t i a t i s t i + 1 … (see Figure 7 for the training process). The detector executes the action at under state S t and gets current reward r t and the next time state s t + 1 from the environment. The policy loss function is in the form of A t log π a ^ t , the log probability of the selected action a ^ t under the probability distribution and weighted by the corresponding advantage value A t . The actor and critic NN are constructed and trained simultaneously by minimizing both value and policy loss functions using the stochastic gradient descent (SGD) method (see Algorithm 1). For prediction we use the NN that implements the policy function by selecting the action with the highest probability. We summarized some common elements and differences among the RL, GAN and the proposed models in Table 2.
Algorithm 1 Algorithm for anomaly detection based on A3C.
  1:
Initialize global shared parameters θ , θ v and counter N:=0;
  2:
Initialize worker thread parameters θ , θ v and time step t:=1;
  3:
while N < N m a x do
  4:
   d θ :=0, d θ v :=0;
  5:
    θ := θ , θ v := θ v ;
  6:
    t s t a r t :=t, get state s t ;
  7:
   while ( s t is NOT terminal) or (t- t s t a r t <= t m a x ) do
  8:
      perform actions a t = π ( a t | s t ; θ ) ;
  9:
      get reward r t and transition to s t + 1 ;
10:
      t:=t+1, N:=N+1
11:
   end while
12:
   if ( s t is terminal state) then R:=0; R:=V( s t , θ v );
13:
   end if
14:
   for j ( t 1 , t s t a r t ) do
15:
      R:= r j + γ R;
16:
d θ : = d θ + l o g π ( a j | s j ; θ ) ( R V ( s j ; θ v ) ) / θ + β H ( π ( a j | s j ; θ ) ) / θ ;
17:
d θ v : = d θ v + ( R V ( s j ; θ v ) ) 2 / θ v
18:
   end for;
19:
   update d θ and d θ v to global asynchronously
20:
end while
We proposed replacing the original MLP with CNN for the anomaly detection in images/video areas and using attention mechanism for sequential tasks. The inputs to the CNN network are three dimensional tensors representing the images or the frames of the video. The network structure has three convolution layers which convolves the inputs in order with some fixed number of filters and strides which are dependent on the specification of tasks. The last layer is fully-connected hidden layer of fixed number of units projects to the Q-value. All these layers are separated by activation Rectifier Linear Units (ReLu). The optimizer of “RMSProp” with momentum is employed to train the network. Using attention mechanism instead of LSTM for sequential task is to enhance the efficiency while retaining the effectiveness.

5. Results

We conducted evaluation tests on three benchmark datasets (AWID, NSL-KDD, Time Series Dataset) and compared with other methods to demonstrate the effectiveness of our method. We also tested the latest dataset of [57] ’CIC-IDS-2017’ (https://www.unb.ca/cic/datasets/ids-2017.html, accessed on 1 February 2021) and [58] ’CIRA-CIC-DoHBrw-2020’ (https://www.unb.ca/cic/datasets/dohbrw-2020.html, accessed on 1 February 2021) from Canadian Institute for Cybersecurity. The number of samples for DDoS was 225,745 with 78 attributes and 1 redundant attribute was deleted during preprocessing. NSL-KDD dataset was although old and there were even research papers arguing against using the dataset, however this dataset was the benchmark or standard for network traffic detection which led to the development of network intrusion detection. Furthermore, the issue of low detection rate for U2R and R2L remained. The detection rate for these two anomalies were very low even for the NSL-KDD cup winner back then. There were few conference papers reporting the detection rate for the two anomalies was improved greatly. However, the realization details were missing and could not be replayed. NSL-KDD and several different new datasets including AWID and real-world time series were introduced in the experiments to address the deficiency.
Metrics of accuracy, precision, recall and F1 were defined in Equation (10). P, N, TP, FP, TN, FN stand for positive, negative, true positive, false positive, true negative, and false negative, respectively. Precision (also Positive Predictive Value) is the fraction of relevant instances among the retrieved instances, while recall (or sensitivity) is the fraction of the total amount of relevant instances that were actually retrieved.
A c c u r a c y = T P + T N T P + F P + T N + F N P r e c i s i o n = T P T P + F P R e c a l l = T P T P + F N F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l

5.1. AWID Datasets Test

AWID datasets were focused on intrusion detection for wireless networks (http://icsdweb.aegean.gr/awid/index.html, accessed on 1 February 2021). We used the reduced “AWID-CLS-R-Trn” and “AWID-CLS-R-Tst” as the training and test datasets, respectively. Original record includes 154 attributes and 1 class attribute which denotes the record is normal or under attack. The complete features should be analyzed to extract useful ones from noises which could promote prediction capability and reduce complexity. We preprocessed the complete features set into 46 features and 1 class types excluding the invalid and misleading features. Different attacks are organized into a normal and three types, namely, flooding, impersonation and injection. The number of four kinds of samples are listed in Table 3.
Ten-fold cross validation method was used for the training and testing. The dataset was reshuffled randomly to avoid sample’s similarity in each group and was divided into 10 groups with 9 of them for training and the left one group for testing. The process repeats 10 times. During training the Keras early stopping callback was used to avoid overfitting. During training five optimizers (adam, rmsprop, adagrad, adadelta, and sgd) were tested. The best performed optimizer of “Adam”, categorical cross-entropy loss function for multi-classification and 10 4 for the learning rate were chosen. MLP design was straightforward with three fully connected layers, dropping out set, optimizer of “adam”, loss function of MSE, batch size of 100 and around ten thousand (early dropping) weights needed to be back propogated to reach the minimum loss rate.
We compared and listed results in Table 4 the detection performances using the proposed model with MLP, SMOTE and adversarial RL models.

5.2. Time Series Anomaly Test

Sequential anomaly detecting tasks are faced with learning from experience on the fly. Rewards and discounts in theory are part of the problem definitions but in practice they are well-designed parameters and can be tweaked. Two dataset are used for this test. The first contains 367 Yahoo time series with each labeled as normal or not and these are used as environment for training to establish the anomaly detector. The action space is the {anomaly, normal} taken by the detector. The test set consists of 58 simulated and real-world time series (such as traffic, exchange and tweets) with anomalies.
The sliding window mechanism was used to detect the anomalies in the given time series. The size of the slide window for state and reward functions were set to 25, reward was 5 points for true positive, −1 point for false positive or error alarm, 1 point for the true negative and −5 for the false negative or miss alarm. Red and green 3-D plot in Figure 8 represent different action-value when the detector adopted different actions. The discount factor, epsilon for exploration and the episode were set to 0.8, 0.1 and 1000, respectively. The test results showed loss rate for the two different actions were below 3%.
We compared the performances of the proposed anomaly detector with the time series anomaly detector using RNN and Q-learning in RL. The results for the second dataset were demonstrated in Figure 9 where red and blue curve(a) represent the real and forecasted value for the training and testing. The blue-shadowed area around 4500 indicated anomalies and the curve(b) showed smoothed error between real and forecasted values. The best test score using the RNN and Q-learning of RL method was precision = 52.3%, recall = 100% and F1 = 68.9% among 58 labeled real-world and artificial timeseries data files and it was demonstrated in Figure 10. The precision rate for all the test set was unsatisfactorily low. Same test set was run on the proposed anomaly detector and the best score was precision = 75.3%, recall = 95.8% and F1 = 84.3%. Blue, green and red curve in Figure 10 represent the aforementioned targeted time series with best scores, action and rewards based on the actions, respectively, for the test result.

5.3. NSL-KDD Network Anomaly Test

This dataset organize different attacks into four types, namely, DoS, Probe, R2L and U2R. For example, ‘neptune’, ‘teardrop’, ‘land’, etc. are grouped in the DoS attack, ‘portsweep’ and ‘nnmap’, etc. the Probe, the R2L, and ‘guess_passwd’ and ‘named’, etc. ‘buffer_overflow’, ‘rootkit’, etc. the U2R.
Generally dataset needs to be preprocessed first for follow-up RL modelling process and exploratory data analysis is used to get some statistics(such as the number of features and data), data cleaning to delete those invalid features, the standardization of the numeric features and the visualization of the high-dimensional data. We used the t-SNE approach on the NSL-KDD training dataset which consists of 125,974 data points with 122 features each and classified as either 1 or 0. The t-SNE illustration in Figure 11 showed the relatively clear boundary (red stands for 0, black 1) of the binary classification. Nine traditional methods of anomaly detection were compared using the samples of NSL-KDD training data with dimension reduction. The red circles in Figure 12 showed the not so clear decision boundary for each detector.
The datasets are used as the environments in the RL tasks and they are split as the training and test. There are 122 features and 5 classes (4 anomaly and 1 normal) for the specific NSL-KDD dataset and the 122 observational states are used as the system states. Four anomaly (Probe, DoS, U2R, L2R) and one normal classification are defined as the action space. The correctly recognized anomalies are rewarded 3 points and −3 points otherwise.
Machine learning algorithms such as Random forest (RF), Bayes, adaboost, etc. and our proposed method all achieved around 99 percent of precision for anomaly of ‘DDoS’, ‘Portscan’. For imbalanced datasets (much fewer negative samples) of ‘Bot’, ‘web attack’ and ‘infiltration’ the precision of benign traffic was around 99 percent, the precisions of three different web attacks (Brutal force, XSS, SQL injection) were around 70%, 45% and 99% with RF. Our proposed method achieved same precision for benign traffic and slightly better accuracies for three web traffics. Due to page size limit we did not display all the results data. Results of classification using RF was displayed in Figure 13. Our proposed methods also achieved nearly the same precision of 99%.
Detection rate for Probe and Dos using KNN, C4.5 were acceptable while anomaly of U2R and R2L were unsatisfactorily low as demonstrated in the confusion matrix (see Figure 14b) where U2R was often misclassified as Probe and R2L as normal. The test scores were low for attacks such as “warezmaster”, “saint” and “processtable” in Figure 14a. Although the detection rate of these anomalies using RL has already increased to around 20 percent which surpassed the results from KDD-CUP winner (13.2% for U2R, 8.4% for R2L in [59,60]), there is still large space for improvement. The relative low number of training samples and these anomalies embedding the data in the payload of normal or probe traffic might lead to the poor performances of the anomaly detector. There were papers [61,62] reporting the detection rate was improved greatly, however the realization details to support the conclusions was not very clear. Even if the conclusion of improved performances was confirmed, the potential overfitting that might lead to the high detection rate and the generalization ability to other tasks still remained low.
The statistical one sample sign test was used for the results. Each method repeat 10 times (using same parameters and random seed number) and we checked the raw data to avoid the overfitting errors and used the average to represent the finals. The averages became the null hypothesis. We computed the obtained frequency which was above the average, and the expected frequency was 5, which meant five below average and five above. Chi-square value, degrees of freedom, and alpha level of 0.05 was set to determine the hypothesis should be rejected or not. The average hypothesis were accepted and the ranges were ± 0.5 % for the metrics of accuracy, precision, recall.
Comparison of the rewards and losses of the proposed model with other RL methods was illustrated in Figure 15. Blue curve with circle represents the proposed model which surpassed the existing RL models in terms of rewards and loss rates. The training time for the RL models was around 25 s per epoch. We tested 500 epochs and found after 100 epochs the performances were not improved (we considered after 100 epochs the model converged). So the average time for training one model was around 2500 s. We compared the training time for the whole 300 epochs between the proposed model and others (see Figure 16). The detection time (test time) was about 15 s for 22,544 samples. Results of accuracy, precision, recall rate and F1 were listed in Table 5 where numbers in bold represent the best score for the specific metric. The proposed model outperformed other methods in terms of accuracy, precision and F1 score.
Inspired by [55], we compared anomaly detection performances between GAN and the proposed model based on (1) actor and critic NNs function similarly to the generative and adversarial NNs of GAN, respectively, (2) these two employed same methods of ‘freezing learning’, ‘batch normalization’. In proposed RL models the discount γ was set to 0.95, learning rate α = 0.002 and the exploration rate set to 0.85. The number of steps between target network updates was τ = 1000. Overall training is 10 M steps. The agent is evaluated every 100 K steps, and the best policy across these evaluations is kept as the output of the learning process. The size of the experience replay memory is 100 K tuples. The memory gets sampled to update the network every four steps with minibatches of size 32. The simple exploration policy used is an ε -greedy policy with the decreasing linearly from 1 to 0.1 over 100 K steps.

6. Discussion

The proposed adaptable RL architecture of A3C framework employs appropriate policy and value DNNs for specific domains which are TCN structure with attention for sequence and CNN for images or video. In theory, for traditional RNN and its variant LSTM sequence models, hidden state matrix is O ( d 2 ) where d is the dimension and input n is often the sequence length or sliding window size. The time complexity is O ( n × ( d 2 ) ) where typically d = 1000 and n is in the order of less than 100. The sliding window we used in tests is 25. Hence the complexity is O ( 25 × 1000 2 ) . The complexity is determined by input and hidden state matrix. In contrast, the proposed TCN with attention, the complexity is O ( ( n 2 ) × d ) . This is O ( 25 2 × 1000 ) . The input length is far less than hidden state which might explain why the TCN with attention is much time efficient than the traditional counterparts. In practice, we compared the performances between LSTM and attention mechanisms and found that the training time reduced to around 80 percent by the attention model while the accuracy remained comparable [63]. That was one reason TCN with attention instead of LSTM for anomaly detection in this paper.
Previous anomaly detection approaches were challenged by many problems especially the data labeling and sequence prediction. Applying RL provides a new angle to the anomaly detection tasks and the test results indicated the effectiveness of this method. Actor-critic was a promising architecture in that actor could select actions with the least computation especially for continuous actions and the stochastic policy could be learned for the best probability of all kinds of actions.
Different to large labeled datasets like ImageNet accelerated the development of supervised learning, lack of standardization of environments(equivalent to the labeled datasets) in RL might hinder the application of RL to the anomaly field which also makes it difficult to reproduce published research and compare results with.

7. Conclusions

The proposed anomaly detection model consists of one global and twelve workers which was provided with a copy of the environment, trained independently and then updated the global asynchronously. An attention mechanism was proposed for the actor network of global and each worker in detecting sequential anomalies. CNN architecture was for the task of detecting images, video anomalies. The critic was constructed with the Feed Forward NN with three layers connected with ReLU activation function. Dynamic learning rate was used to update weights of the actor under the guidance of critic. Larger learning rate was set initially and decayed over time to make the training process fast and stable.
Test results based on three benchmark datasets showed our model’s effectiveness in detecting anomalies and the performances surpassed many existing counterparts or at least comparable to these models.

Author Contributions

Conceptualization, K.Z. and W.W.; methodology, K.Z.; software, K.Z., T.H.; validation, K.Z., T.H. and K.D.; formal analysis, K.Z.; investigation, K.Z., T.H.; resources, W.W.; data curation, K.Z., T.H.; writing—original draft preparation, K.Z.; writing—review and editing, K.Z., T.H.; visualization, K.Z., T.H.; supervision, W.W.; project administration, W.W.; funding acquisition, K.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by Institute of Computer Application, China Academy of Engineering Physics(CAEP), grant number XH35 and SJ2019A05.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
A3CAsynchronouse Actor-Critic Advantage
A2CActor-Critic Advantage
(D)RL(Deep) Reinforcement Learning
MDPMarkov Decision Process
MRPMarkov Reward Process
LSTMLong Short Term Memory
TCNTemporal Convolutional Network
GANGenerative Adversarial Network
NNNeural Network
RNNRecurrent Neural Network
MLPMulti Layer Perceptron
FCNFully Convolutional Network
(D)DQN(Double) Deep Q-Network
MCMonte Carlo
TDTemporal-difference
CNNConvolutional Neural Network

References

  1. Zimek, A.; Schubert, E. Outlier Detection. In Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer: New York, NY, USA, 2017; pp. 1–5. [Google Scholar] [CrossRef]
  2. Moya, M.M.; Hush, D.R. Network constraints and multi-objective optimization for one-class classification. Neural Netw. 1996, 9, 463–474. [Google Scholar] [CrossRef]
  3. Altman, N.S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef] [Green Version]
  4. Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. SIGMOD Rec. 2000, 29, 93–104. [Google Scholar] [CrossRef]
  5. Goldstein, M.; Uchida, S. A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 2016, 11, e0152173. [Google Scholar] [CrossRef] [Green Version]
  6. Liu, F.T.; Ting, K.M.; Zhou, Z. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar] [CrossRef]
  7. Lazarevic, A.; Kumar, V. Feature Bagging for Outlier Detection. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining; Association for Computing Machinery, Chicago, IL, USA, 21–24 August 2005; KDD’05. pp. 157–166. [Google Scholar] [CrossRef]
  8. Said Elsayed, M.; Le-Khac, N.A.; Dev, S.; Jurcut, A.D. Network Anomaly Detection Using LSTM Based Autoencoder. In Proceedings of the 16th ACM Symposium on QoS and Security for Wireless and Mobile Networks, Alicante, Spain, 16–20 November 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 37–45. [Google Scholar]
  9. Di Mattia, F.; Galeone, P.; De Simoni, M.; Ghelfi, E. A survey on gans for anomaly detection. arXiv 2019, arXiv:1906.11632. [Google Scholar]
  10. Moustafa, N.; Hu, J.; Slay, J. A holistic review of Network Anomaly Detection Systems: A comprehensive survey. J. Netw. Comput. Appl. 2019, 128, 33–55. [Google Scholar] [CrossRef]
  11. Fernandes, G.; Rodrigues, J.J.; Carvalho, L.F.; Al-Muhtadi, J.F.; Proença, M.L. A comprehensive survey on network anomaly detection. Telecommun. Syst. 2019, 70, 447–489. [Google Scholar] [CrossRef]
  12. Domingues, R.; Filippone, M.; Michiardi, P.; Zouaoui, J. A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern Recognit. 2018, 74, 406–421. [Google Scholar] [CrossRef]
  13. Wang, H.; Bah, M.J.; Hammad, M. Progress in Outlier Detection Techniques: A Survey. IEEE Access 2019, 7, 107964–108000. [Google Scholar] [CrossRef]
  14. Chalapathy, R.; Chawla, S. Deep Learning for Anomaly Detection: A Survey. arXiv 2019, arXiv:1901.03407. [Google Scholar]
  15. Bulusu, S.; Kailkhura, B.; Li, B.; Varshney, P.K.; Song, D. Anomalous Example Detection in Deep Learning: A Survey. IEEE Access 2020, 8, 132330–132347. [Google Scholar] [CrossRef]
  16. Xu, X. Sequential anomaly detection based on temporal-difference learning: Principles, models and case studies. Appl. Soft Comput. J. 2010, 10, 859–867. [Google Scholar] [CrossRef]
  17. Oh, M.H.; Iyengar, G. Sequential anomaly detection using inverse reinforcement learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019; ACM: New York, NY, USA, 2019; pp. 1480–1490. [Google Scholar] [CrossRef]
  18. He, Y.; Richard Yu, F.; Zhao, N.; Leung, V.C.; Yin, H. Software-Defined Networks with Mobile Edge Computing and Caching for Smart Cities: A Big Data Deep Reinforcement Learning Approach. IEEE Commun. Mag. 2017, 55, 31–37. [Google Scholar] [CrossRef]
  19. Zhu, H.; Cao, Y.; Wang, W.; Jiang, T.; Jin, S. Deep Reinforcement Learning for Mobile Edge Caching: Review, New Features, and Open Issues. IEEE Netw. 2018, 32, 50–57. [Google Scholar] [CrossRef]
  20. Xiao, L.; Wan, X.; Dai, C.; Du, X.; Chen, X.; Guizani, M. Security in Mobile Edge Caching with Reinforcement Learning. IEEE Wirel. Commun. 2018, 25, 116–122. [Google Scholar] [CrossRef] [Green Version]
  21. Akazaki, T.; Liu, S.; Yamagata, Y.; Duan, Y. Falsification of Cyber-Physical Systems; Springer: Cham, Switzerland, 2018; Volume 2, pp. 456–465. [Google Scholar]
  22. Yu, Z.; Machado, P.; Zahid, A.; Abdulghani, A.M.; Dashtipour, K.; Heidari, H.; Imran, M.A.; Abbasi, Q.H. Energy and Performance Trade-Off Optimization in Heterogeneous Computing via Reinforcement Learning. Electronics 2020, 9, 1812. [Google Scholar] [CrossRef]
  23. Han, G.; Xiao, L.; Poor, H.V. Two-dimensional anti-jamming communication based on deep reinforcement learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2087–2091. [Google Scholar] [CrossRef]
  24. Xiao, L.; Li, Y.; Han, G.; Liu, G.; Zhuang, W. PHY-Layer Spoofing Detection with Reinforcement Learning in Wireless Networks. IEEE Trans. Veh. Technol. 2016, 65, 10037–10047. [Google Scholar] [CrossRef]
  25. Wan, X.; Sheng, G.; Li, Y.; Xiao, L.; Du, X. Reinforcement Learning Based Mobile Offloading for Cloud-Based Malware Detection. In Proceedings of the 2017 IEEE Global Communications Conference, GLOBECOM, Singapore, 4–8 December 2017; Volume 2018, pp. 1–6. [Google Scholar] [CrossRef]
  26. Xiao, L.; Li, Y.; Han, G.; Dai, H.; Poor, H.V. A Secure Mobile Crowdsensing Game with Deep Reinforcement Learning. IEEE Trans. Inf. Forensics Secur. 2018, 13, 35–47. [Google Scholar] [CrossRef]
  27. Chatterjee, M.; Namin, A.S. Detecting phishing websites through deep reinforcement learning. In Proceedings of the 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, USA, 15–19 July 2019; Volume 2, pp. 227–232. [Google Scholar] [CrossRef]
  28. Kurt, M.N.; Ogundijo, O.; Li, C.; Wang, X. Online Cyber-Attack Detection in Smart Grid: A Reinforcement Learning Approach. IEEE Trans. Smart Grid 2018, 10, 5174–5185. [Google Scholar] [CrossRef] [Green Version]
  29. Lu, H.; Li, Y.; Mu, S.; Wang, D.; Kim, H.; Serikawa, S. Vehicles Using Reinforcement Learning. IEEE Internet Things J. 2018, 5, 2315–2322. [Google Scholar] [CrossRef]
  30. Mahmud, M.; Kaiser, M.S.; Hussain, A.; Vassanelli, S. Applications of Deep Learning and Reinforcement Learning to Biological Data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2063–2079. [Google Scholar] [CrossRef] [Green Version]
  31. Otoum, S.; Kantarci, B.; Mouftah, H. Empowering Reinforcement Learning on Big Sensed Data for Intrusion Detection. In Proceedings of the 2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019. [Google Scholar] [CrossRef]
  32. Lopez-Martin, M.; Carro, B.; Sanchez-Esguevillas, A. Application of deep reinforcement learning to intrusion detection for supervised problems. Expert Syst. Appl. 2020, 141, 112963. [Google Scholar] [CrossRef]
  33. Alauthman, M.; Aslam, N.; Al-kasassbeh, M.; Khan, S.; Al-Qerem, A.; Raymond Choo, K.K. An efficient reinforcement learning-based Botnet detection approach. J. Netw. Comput. Appl. 2020, 150, 102479. [Google Scholar] [CrossRef]
  34. Malialis, K.; Kudenko, D. Distributed response to network intrusions using multiagent reinforcement learning. Eng. Appl. Artif. Intell. 2015, 41, 270–284. [Google Scholar] [CrossRef]
  35. Hendrycks, D.; Mazeika, M.; Dietterich, T. Deep anomaly detection with outlier exposure. In Proceedings of the 7th International Conference on Learning Representations, (ICLR), New Orleans, LA, USA, 6–9 May 2019; pp. 1–18. [Google Scholar]
  36. Wang, S.; Zeng, Y.; Liu, X.; Zhu, E.; Yin, J.; Xu, C.; Kloft, M. Effective End-to-end Unsupervised Outlier Detection via Inlier Priority of Discriminative Network. NeurIPS 2019, 1–14. Available online: https://ml.informatik.uni-kl.de/publications/2019/NeurIPS19_final.pdf (accessed on 1 February 2021).
  37. Zhu, Y.; Yang, K. Tripartite Active Learning for Interactive Anomaly Discovery. IEEE Access 2019, 7, 63195–63203. [Google Scholar] [CrossRef]
  38. Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep autoencoding Gaussian mixture model for unsupervised anomaly detection. In Proceedings of the 6th International Conference on Learning Representations, (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–19. [Google Scholar]
  39. Zhou, C.; Paffenroth, R.C. Anomaly detection with robust deep autoencoders. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; ACM: New York, NY, USA, 2017. Part F1296. pp. 665–674. [Google Scholar] [CrossRef]
  40. Chen, J.; Sathe, S.; Aggarwal, C.; Turaga, D. Outlier detection with autoencoder ensembles. In Proceedings of the 17th SIAM International Conference on Data Mining, (SDM 2017), Houston, TX, USA, 27–29 April 2017; pp. 90–98. [Google Scholar] [CrossRef] [Green Version]
  41. Hundman, K.; Constantinou, V.; Laporte, C.; Colwell, I.; Soderstrom, T. Detecting spacecraft anomalies using LSTMs and nonparametric dynamic thresholding. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, 19–23 August 2018; ACM: New York, NY, USA, 2018; pp. 387–395. [Google Scholar] [CrossRef] [Green Version]
  42. Lee, T.J.; Gottschlich, J.; Tatbul, N.; Metcalf, E.; Zdonik, S. Greenhouse: A Zero-Positive Machine Learning System for Time-Series Anomaly Detection. arXiv 2018, arXiv:1801.03168. [Google Scholar]
  43. Habler, E.; Shabtai, A. Using LSTM encoder-decoder algorithm for detecting anomalous ADS-B messages. Comput. Secur. 2018, 78, 155–173. [Google Scholar] [CrossRef] [Green Version]
  44. Ergen, T.; Kozat, S.S. Unsupervised Anomaly Detection With LSTM Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 3127–3141. [Google Scholar] [CrossRef] [Green Version]
  45. Li, C.; Qiu, M.; Li, C. Reinforcement Learning for Cybersecurity. Reinf. Learn. Cyber-Phys. Syst. 2019, 155–168. [Google Scholar] [CrossRef]
  46. Liu, Y.; Li, Z.; Zhou, C.; Jiang, Y.; Sun, J.; Wang, M.; He, X. Generative Adversarial Active Learning for Unsupervised Outlier Detection. IEEE Trans. Knowl. Data Eng. 2020, 32, 1517–1528. [Google Scholar] [CrossRef] [Green Version]
  47. Sutton, R.S.; Mahmood, A.R.; White, M. An emphatic approach to the problem of off-policy temporal-difference learning. J. Mach. Learn. Res. 2016, 17, 1–29. [Google Scholar]
  48. Kasim, Ö. An efficient and robust deep learning based network anomaly detection against distributed denial of service attacks. Comput. Netw. 2020, 180, 107390. [Google Scholar] [CrossRef]
  49. Awad, M.; Alabdallah, A. Addressing Imbalanced Classes Problem of Intrusion Detection System Using Weighted Extreme Learning Machine. Int. J. Comput. Netw. Commun. 2019, 11, 5. [Google Scholar] [CrossRef]
  50. Sharma, J.; Giri, C.; Granmo, O.C.; Goodwin, M. Multi-layer intrusion detection system with ExtraTrees feature selection, extreme learning machine ensemble, and softmax aggregation. EURASIP J. Inf. Secur. 2019, 2019, 15. [Google Scholar] [CrossRef] [Green Version]
  51. Bahdanau, D.; Brakel, P.; Xu, K.; Goyal, A.; Lowe, R.; Pineau, J.; Courville, A.; Bengio, Y. An Actor-Critic Algorithm for Sequence Prediction. arXiv 2016, arXiv:1607.07086. [Google Scholar]
  52. Zhong, C.; Gursoy, M.C.; Velipasalar, S. Deep actor-critic reinforcement learning for anomaly detection. In Proceedings of the 2019 IEEE Global Communications Conference, GLOBECOM 2019, Waikoloa, HI, USA, 9–13 December 2019. [Google Scholar] [CrossRef] [Green Version]
  53. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-Learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence; 2016; pp. 2094–2100. Available online: https://dl.acm.org/doi/10.5555/3016100.3016191 (accessed on 21 February 2021).
  54. Deep Reinforcement Learning and Control Lectures. Available online: http://www.andrew.cmu.edu/course/10-403/ (accessed on 21 February 2021).
  55. Pfau, D.; Vinyals, O. Connecting generative adversarial networks and actor-critic methods. arXiv 2016, arXiv:1610.01945. [Google Scholar]
  56. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  57. Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. In Proceedings of the International Conference on Information Systems Security and Privacy (ICISSP), Funchal, Portugal, 22–24 January 2018; pp. 108–116. [Google Scholar]
  58. MontazeriShatoori, M.; Davidson, L.; Kaur, G.; Lashkari, A.H. Detection of DoH Tunnels using Time-series Classification of Encrypted Traffic. In Proceedings of the 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada, 17–22 August 2020; pp. 63–70. [Google Scholar]
  59. Sabhnani, M.; Serpen, G. Application of Machine Learning Algorithms to KDD Intrusion Detection Dataset within Misuse Detection Context. In Proceedings of the MLMTA, Las Vegas, NV, USA, 23–26 June 2003; pp. 209–215. [Google Scholar]
  60. Yang, Y.; Zheng, K.; Wu, C.; Yang, Y. Improving the classification effectiveness of intrusion detection by using improved conditional variational autoencoder and deep neural network. Sensors 2019, 19, 2528. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  61. Lin, Z.; Shi, Y.; Xue, Z. Idsgan: Generative adversarial networks for attack generation against intrusion detection. arXiv 2018, arXiv:1809.02077. [Google Scholar]
  62. Yin, H.; Xue, M.; Xiao, Y.; Xia, K.; Yu, G. Intrusion Detection Classification Model on an Improved k-Dependence Bayesian Network. IEEE Access 2019, 7, 157555–157563. [Google Scholar] [CrossRef]
  63. Zhou, K.; Wang, W.; Hu, T.; Deng, K. Time Series Forecasting and Classification Models Based on Recurrent with Attention Mechanism and Generative Adversarial Networks. Sensors 2020, 20, 7211. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Agent environment interaction (left), action and reward sequences (right).
Figure 1. Agent environment interaction (left), action and reward sequences (right).
Entropy 23 00274 g001
Figure 2. (a) Illustration of state value computation. (b) Action value.
Figure 2. (a) Illustration of state value computation. (b) Action value.
Entropy 23 00274 g002
Figure 3. Target network illustration.
Figure 3. Target network illustration.
Entropy 23 00274 g003
Figure 4. Anomaly detection architecture based on GAN.
Figure 4. Anomaly detection architecture based on GAN.
Entropy 23 00274 g004
Figure 5. Anomaly detection architecture based on A3C.
Figure 5. Anomaly detection architecture based on A3C.
Entropy 23 00274 g005
Figure 6. Proposed network for actor.
Figure 6. Proposed network for actor.
Entropy 23 00274 g006
Figure 7. Training process for detector and evaluator.
Figure 7. Training process for detector and evaluator.
Entropy 23 00274 g007
Figure 8. Action-value function for actions of time series anomaly.
Figure 8. Action-value function for actions of time series anomaly.
Entropy 23 00274 g008
Figure 9. (a) Time series anomaly test results without smooth. (b)smoothed error for the time series anomalies.
Figure 9. (a) Time series anomaly test results without smooth. (b)smoothed error for the time series anomalies.
Entropy 23 00274 g009
Figure 10. Time series anomaly using the RNN and Q-learning of RL.
Figure 10. Time series anomaly using the RNN and Q-learning of RL.
Entropy 23 00274 g010
Figure 11. t-SNE illustration using NSL-KDD dataset.
Figure 11. t-SNE illustration using NSL-KDD dataset.
Entropy 23 00274 g011
Figure 12. Outlier decision boundary comparisons using NSL-KDD dataset.
Figure 12. Outlier decision boundary comparisons using NSL-KDD dataset.
Entropy 23 00274 g012
Figure 13. Detailed results for dataset CIC-IDS-2017 using RF algorithm.
Figure 13. Detailed results for dataset CIC-IDS-2017 using RF algorithm.
Entropy 23 00274 g013
Figure 14. (a) Test scores for different attacks. (b) Confusion matrix for five anomalies.
Figure 14. (a) Test scores for different attacks. (b) Confusion matrix for five anomalies.
Entropy 23 00274 g014
Figure 15. Rewards and losses comparisons among different models by epochs.
Figure 15. Rewards and losses comparisons among different models by epochs.
Entropy 23 00274 g015
Figure 16. Training time comparisons among different models by epochs.
Figure 16. Training time comparisons among different models by epochs.
Entropy 23 00274 g016
Table 2. Comparaions among RL, GAN and the proposed models.
Table 2. Comparaions among RL, GAN and the proposed models.
MethodsAlgorithmsApplicationFeatures
Value-basedQ-learning, DDQN, etc.Simple environmentSample efficient, steady
Policy-basedPolicy gradient, REINFORCEMENTcontinuous, stochasticUnstable, time consuming
environment
Actor-Critic
(AC)
Combine Value and policy,
Actor(value) computing actions,
Critic (policy) producing q
value of actions;outperformed
Value, Policy separately
Complex environment
(such as 2/3D video gaming)
Batch normalization
Target network
Replay buffer
Entropy regularization
Compatibility, etc.
GANGenerator, DiscriminatorImage video/audio, etc.Batch normalization,
label smoothing, etc.
A2CIntroducing advantage based on AC,
evaluate goodness of action
and its improvement space
Complex environmentreduces high variance of
policy networks, stable
A3COne global, several workers share
the same environment;
Introducing asynchronous
based on A2C
Complex environmentfurther improve efficiency
based on A2C, robustness,
speed
Proposed
method
Similar to A3CComplex environmentAdaptable,
Image: CNN
Sequence: TCN
Table 3. Numbers for AWID reduced training and test dataset.
Table 3. Numbers for AWID reduced training and test dataset.
AWID-CLS-R-TrnAWID-CLS-R-Tst
Flooding48,484Flooding8097
Impersonation48,522Impersonation20,079
Injection65,379Injection16,682
Normal1,633,190Normal530,785
Table 4. Test results comparisons among proposed method and others.
Table 4. Test results comparisons among proposed method and others.
MethodTypeAccuracy
(±0.5%)
Precision
(±0.5%)
Recall
(±0.5%)
F1
(±0.5%)
MLPFlooding0.94880.91950.62880.7469
Impersonation0.93260.61950.54910.5822
Injection0.94960.92021.000.9584
Normal0.98720.94981.000.9732
Adversarial
RL model
Flooding0.99410.94520.61830.7476
Impersonation0.94790.32010.43920.3703
Injection0.99800.93580.99990.9668
Normal0.94000.97270.96200.9673
SMOTEFlooding0.99240.60010.62110.6104
Impersonation0.95210.33980.93030.4978
Injection0.97200.41051.000.5821
Normal0.93960.98970.87690.9299
Proposed
Anomaly
Detector
Flooding0.99300.98470.64300.7780
Impersonation0.95710.47360.94020.6299
Injection0.98340.98480.86790.9227
Normal0.99170.99030.98510.9877
Table 5. Metrics comparisons among proposed model and others.
Table 5. Metrics comparisons among proposed model and others.
MethodAlgorithmAccuracy
(±0.5%)
Precision
(±0.5%)
Recall
(±0.5%)
F1
(±0.5%)
Linear modelOCSVM0.65420.69530.65120.6725
EnsembleIF0.79110.84830.71920.7784
ProximityLOF0.68340.79230.65970.7189
KNN0.78080.89330.67060.7661
NN methodsVAE0.79120.83110.70180.7610
MLP0.79660.86790.66470.7529
GANWGAN0.61490.82790.6650.7376
RL methodsPlain-AD0.76300.89020.78290.8331
Multi-AD0.72650.88820.72650.7987
A2C0.79050.89300.79050.8386
Proposed
Anomaly
Detector
0.79200.91100.79010.8463
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhou, K.; Wang, W.; Hu, T.; Deng, K. Application of Improved Asynchronous Advantage Actor Critic Reinforcement Learning Model on Anomaly Detection. Entropy 2021, 23, 274. https://doi.org/10.3390/e23030274

AMA Style

Zhou K, Wang W, Hu T, Deng K. Application of Improved Asynchronous Advantage Actor Critic Reinforcement Learning Model on Anomaly Detection. Entropy. 2021; 23(3):274. https://doi.org/10.3390/e23030274

Chicago/Turabian Style

Zhou, Kun, Wenyong Wang, Teng Hu, and Kai Deng. 2021. "Application of Improved Asynchronous Advantage Actor Critic Reinforcement Learning Model on Anomaly Detection" Entropy 23, no. 3: 274. https://doi.org/10.3390/e23030274

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop