A Novel Virus Capable of Intelligent Program Infection through Software Framework Function Recognition

Guo, Wang; Shu, Hui; Gu, Yeming; Huang, Yuyao; Zhao, Hao; Li, Yang

doi:10.3390/electronics12020460

Open AccessArticle

A Novel Virus Capable of Intelligent Program Infection through Software Framework Function Recognition

by

Wang Guo

,

Hui Shu

^*,

Yeming Gu

,

Yuyao Huang

,

Hao Zhao

and

Yang Li

State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(2), 460; https://doi.org/10.3390/electronics12020460

Submission received: 19 December 2022 / Revised: 13 January 2023 / Accepted: 13 January 2023 / Published: 16 January 2023

(This article belongs to the Special Issue AI in Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

:

Viruses are one of the main threats to the security of today’s cyberspace. With the continuous development of virus and artificial intelligence technologies in recent years, the intelligentization of virus technology has become a trend. It is of urgent significance to study and combat intelligent viruses. In this paper, we design a new type of confirmatory virus from the attacker’s perspective that can intelligently infect software frameworks. We aim for structural software as the target and use BCSD (binary code similarity detection) to identify the framework. By incorporating a software framework functional structure recognition model in the virus, the virus is enabled to intelligently recognize software framework functions in executable files. This paper evaluates the BCSD model that is suitable for a virus to carry and constructs a lightweight BCSD model with a knowledge distillation technique. This research proposes a software framework functional structure recognition algorithm, which effectively reduces the recognition precision’s dependence on the BCSD model. Finally, this study discusses the next researching direction of intelligent viruses. This paper aims to provide a reference for the research of detection technology for possible intelligent viruses. Consequently, focused and effective defense strategies could be proposed and the technical system of malware detection could be reinforced.

Keywords:

virus infection technology; binary code similarity detection technology; software framework function recognition; malware detection

1. Introduction

The infection technique is a crucial part of an infecting virus that enables the virus to self-replicate and infect other software. The activation mechanism of the virus is the key to infection, which has evolved from tampering with the entry point of the targeted executable, to hooking key API functions called by the executable, and to modifying key instructions with a high level of covertness and triggering flexibility in the executable. Nowadays, viruses usually combine obfuscation and encryption techniques to achieve infection. However, today’s viruses have difficulty fighting against increasingly powerful dynamic and static virus detection techniques such as signature detection, calibration and testing, as well as behavioral surveillance [1]. In order to circumvent detection, virus developers will continue to refine virus infection techniques to improve the covertness and flexibility of virus infection. As artificial intelligence technology develops, some virus developers begin to empower the attack chain and enhance the precision of the attack, so as to improve the efficiency and success rate of the attack [2]. This paper studies new virus infection techniques from the virus developers’ perspective, aiming to provide a reference for the research of detection technology for possible intelligent viruses. It will facilitate the investigation of the corresponding detecting and combating methods, and contribute to the development of virus detection technology.

The target of the proposed intelligent virus is the executable based on the software framework, as it contains framework functional modules, whose structure is relatively fixed and is thus easier for the virus to identify and infect. However, as software frameworks are constantly iterating and updating, if one uses feature codes and pattern matching to identify and locate the functional modules of software frameworks, the results can be easily invalidated due to changes in framework versions and compilation environments, etc. Therefore, artificial intelligence techniques with generalized learning capabilities are introduced as the basis for framework functional module identification.

Currently, BCSD technology has been widely used in the field of software analysis, and many BCSD models with excellent performance have emerged. In recent years, artificial intelligence models such as Asm2Vec, SAFE, PalmTree and UniASM are important achievements in the application of artificial intelligence technology in the field of binary analysis. They all achieve binary code similarity detection by converting binary functions into embedding. Based on this mechanism and taking into account the virus’ requirements for generality, this paper optimizes the model in terms of efficiency and effectiveness, and designs a software framework functional structure recognition algorithm to realize the perception of framework functional structure in the target software. Unlike traditional viruses that use PE file structure, pattern matching feature location or other infection methods, virus developers can use intelligent viruses carrying functional structure recognition models to achieve accurate infection, and such a design is completely feasible in theory. The study takes the virus developers’ stance, empowers virus infection techniques with the BCSD model and enables viruses to identify and infect the software framework’s functional structure.

In summary, virus designers will focus on the development of virus technology that uses intelligent techniques to achieve infection. The study of artificial intelligence models for binary code similarity analysis is one of the research hotspots, but is only applied to the field of binary code software analysis, while virus developers are likely to combine the similarity model with virus technology to develop a new type of virus capable of intelligent infection.

This paper analyzes the development direction of virus infection technology according to the current status of virus attack, and discusses the feasibility and advantage of a virus carrying an intelligent model from the attacker’s perspective. It comprehensively analyzes the difficulties in designing new intelligent virus technology in terms of software framework functional structure extraction, BCSD model downsizing and the functional structure identification algorithm design of the software framework. This paper also puts forward possible solutions and proposes an experimental-based virus infection model that is capable of software framework functional intelligent recognition.

The main contributions of the work in this paper include:

(i) Proposing a functional structure recognition algorithm for software frameworks. Within the algorithm, the similarity results of the individual functions will generate the framework function’s comprehensive function similarity through a relatively tolerant subgraph matching method, thus greatly reducing the dependence of framework function recognition on the precision of intelligent models.

(ii) Proposing a virus infection technique carrying a lightweight BCSD model that can precisely infect the framework function. Thanks to the generalization ability of the intelligent model, the intelligent virus technology can achieve not only the infection of known versions of software frameworks, but also, to a certain extent, the identification and infection of unknown versions of frameworks.

(iii) Formally presenting the model proposed in this paper, outlining the overall workflow of virus technology carrying a lightweight BCSD model, describing in detail the algorithm for identifying the functional structure of software frameworks in viruses based on the similarity detection model and providing a specific infection example.

The remaining part of this paper is structured as follows: Section 2 presents the research background and related work. Section 3 analyzes the difficulties of the work in this paper and the corresponding solutions. Section 4 describes the intelligent virus technique carrying structure-aware model proposed in this paper in terms of neural network model selection for function similarity detection, model downsizing and framework functional structure recognition algorithm. Section 5 introduces how the intelligent virus implements the infection of framework software through examples. Section 6 experiments and evaluates, respectively, the infection effectiveness and operational efficiency of the smart virus. Section 7 discusses the possible development directions of virus technologies and Section 8 concludes the whole paper.

2. Background and Related Work

Nowadays, intelligent technology-enabled network attack and defense research is a hot issue in the field of cyber security, as shown in Table 1, which contains some of the major current research results on both offensive and defensive aspects of intelligent technologies.

This paper studies the infection techniques of the novel virus that carries BCSD models, and presents the background and related work from the aspects of software frameworks, the development of virus techniques, and artificial intelligence models for binary code similarity detection. Firstly, we describe the definition and characteristics of software frameworks and explain the reasons why the virus is targeting the software framework; then, we describe the virus technology by layer and demonstrate the specific innovations of the proposed novel technology; finally, we introduce the currently popular artificial intelligence model for binary code similarity detection and analyze its value in the localization of the virus activation point.

2.1. Software Framework

Software framework often refers to software component architecture designed to support and enable the integration and interoperability of components. It provides users with a set of software processing flows and supportive libraries [9]. In object-oriented programming, software frameworks consist of abstract and concrete classes, and instantiation includes the combination and subclassing of existing classes [10]. Currently, there are a large number of software frameworks based on different platforms and different high-level languages, for example, MFC, QT and other graphical interface software frameworks; OpenSSL; and other software frameworks with a certain framework control structure. Software frameworks generally have features such as Common Libraries and extensibility. A Common Library is a collection of functional modules of the software framework. These modules have a relatively fixed structure and interoperate with other modules in the library or their subsets through the interface [11]; extensibility means that users can extend the framework by reloading the functional modules in the Common Library. At the same time, the software developed by reloading the software framework increases credibility more easily [11,12]. From the attacker’s perspective, the relatively fixed functional structure of software frameworks can provide a large number of stable virus activation points. Thus, they are very suitable as a target for viruses to achieve covert infection. In the cases where a virus infects software developed based on the software framework, the virus code can be triggered covertly in the standardized execution process of software frameworks. This is the main reason why the software framework is chosen as infection target for the proposed intelligent virus.

2.2. Virus Infection Technology

Viral infection technologies are constantly evolving and iterating [13]. A large number of published virus analyses have shown that the covertness of virus infection technology is very important [14,15]. The technology was initially implemented by modifying the entry point address of an executable to the address of the virus code appended to the file. Jumping to the virus code by modifying the entry point instruction was also a major method to achieve infection. With the development of infection techniques, the method of hook-IAT has been widely used, which executes the virus code attached to the file by hijacking the address of the function in the import table of an executable. Bundled infection uses a different infecting route. It achieves infection by bundling virus files in an executable to acquire control of the program. The above traditional infection methods are no longer able to escape from current virus detection techniques. Therefore, with the development of artificial intelligence technology, virus developers are also empowering viruses, as can be seen in DeepLocker, an artificial intelligence-enabled malicious code launched in 2018 [16]. Following the trend of virus infection technology, the new virus is to infect the control nodes in the targeted software that are covert and could control activation. This will endow the virus with high invisibility and activation controllability, and greatly strengthen its viability. Concerning this, this paper proposes a new intelligent virus targeted at software framework and empowers its infection with the BCSD model, aiming to provide a new reference for the defense of new intelligent viruses and thus contribute to the development of cybersecurity.

2.3. Binary code Similarity Detection Models

In recent years, numerous deep learning models have emerged to solve problems related to binary function similarity detection [17]. In order to achieve higher effectiveness, one of the current advanced solutions is to compare function similarity by embedding, including transforming functions into code embedding or graphical structure embedding. This study focuses on the binary code similarity detection model of code embedding. Asm2Vec [18] is a similarity detection model that treats assembly code as text and adapts the existing natural language processing technology Word2Vec’s CBOW [19] model. It does not require inputting prior knowledge such as compiler optimization settings, but is weak in detecting the similarity of codes with different architectures. Compared with Asm2Vec, SAFE—a model based on the self-attentive mechanism [20] model—specializes in cross-architecture function similarity identification. The Transformer-based PalmTree [21] model is now an advanced pretrained model. In a comparatively simpler neural network structure than that of SAFE, it can achieve high precision of binary similarity detection. The proposed intelligent virus uses the BCSD model to achieve the recognition of the framework functional structure in the targeted software to achieve the infection of the target’s framework control process.

3. Challenges and Solutions

Figure 1 shows the operating flow of the proposed confirmatory virus, which carries a lightweight software framework function-aware model. The confirmatory virus aims at software framework, identifies its functional structure through an artificial intelligence model and a structure recognition algorithm and then achieves precise infection based on the functional structure’s available virus activation point information. The virus model consists of three main components: Micro-Disassembler; a virus activation point localization model; and Infector. Micro-Disassembler is a disassembly module; the software framework functional structure recognition model is a software framework functional function recognition model based on a lightweight BCSD model; and Infector is a virus infection module. The virus activation point localization model can be represented by Equation (1), which consists of a lightweight BCSD model and a framework functional structure recognition algorithm.

Virus activation point localization model = \{Portable similarity model, Structure Recognition Algorithm\}

(1)

The designed intelligent virus first disassembles the target file based on software development through Micro-Disassembler, then inputs the disassembled functions into the function similarity detection model to derive the embeddings and the cosine similarity between them and the function embeddings in the Framefunc Sets, then recognizes the structure by the functional structure recognition algorithm and outputs the available virus activation points according to the infectable ins in the framework function set. Finally, the infector completes the infection of the target by the above infection points.

3.1. Challenges

Based on the requirements of virus technology and the characteristics of the artificial intelligence model, this paper analyzes the difficulties in implementing the infection from the virus developers’ perspective, taking into account three different aspects: the abstraction of the software framework functional structure; the optimization of the artificial intelligence model; and the precise location strategy of virus infection points.

(i) Software framework functional structure abstraction: In the process of locating the software framework functional structure, a framework functional structure standard needs to be formed as a reference for the artificial intelligence model’s recognizing work. Therefore, the abstraction of the corresponding functional modules of the targeted software framework is a problem that needs to be solved.

(ii) Artificial intelligence model selection and optimization: The current neural network model for similarity detection pursues higher effectiveness, resulting in higher level of model complexity and parameter precision. Thus, the resulted model is not able to meet the demand of virus software, which requires a small size and covert execution, in terms of structure, volume and computational complexity. Therefore, another challenge emerges: transforming artificial intelligence models into lightweight models suitable for viruses, while satisfying the precision of localization and adapting to the sensitivity of viruses to volume.

(iii) Precise recognition of framework functional structure: In software developed from software frameworks, user code is usually deeply coupled with the framework structure in the software, so simple feature code matching can hardly achieve the localization of the framework functional structure interwoven in the software. In addition, for infection methods based on analysis of the target’s functional framework, to achieve accurate localization of framework functional code under multiple uncertainties (e.g., different compilation optimization levels, software framework versions, etc.) is a difficult task.

3.2. Solutions

According to the analysis in the previous section, when design a virus carrying a software framework functional structure recognition model, it is necessary to consider both the precision of the recognition of the functional modules, so that the virus can lurk in the software framework’s native control flow and achieve a more covert infection than traditional infection methods; and the optimization of the artificial intelligence model in terms of size and computing overhead. To address the above challenges, the following possible solutions are proposed from the attacker’s perspective.

(i) By analyzing the operation mechanism and main functions of the targeted software framework, a framework function set is proposed to represent the framework functions. Feature extraction is performed on the established framework functional structure of the targeted framework. Function information, structure information and virus available activation point information, etc., in the framework functional structure are extracted separately, which together form a framework function’s function set. A software framework can be composed of several framework function sets.

(ii) Since covertness is very important for viruses, the intelligent model carried by the virus must be both highly precise and of small size. Firstly, by evaluating different binary similarity models in terms of model complexity and detection precision, we select a model suitable for lightweight processing; then, we implement lightweight framework structure-aware models through the knowledge distillation method.

(iii) A localization algorithm is proposed aimed at the software framework functional structure to realize the precise location of infectable functional nodes in the target. Taking the similarity detection results of individual functions output by the intelligent model as the basis, and through recognizing the subgraph of the software framework functional module in the targeted software, the algorithm achieves the precise localization of functional modules in the target.

As stated above, this paper designs a recognition algorithm for a software framework functional structure by improving the binary code similarity detection model, and develops a confirmatory virus capable of software framework functional structure recognition.

4. Software Framework Function-Aware Virus Infection Technique

4.1. Software Framework Functional Structure Set

Software frameworks increase the modularity of the framework by encapsulating unstable implementation details in stable interfaces. These encapsulated modular functions are the unchangeable features of software frameworks called frozen spots, which are stable code structures in software frameworks with relatively fixed functions and control flow to provide the framework functionality required in the development process [22]. Meanwhile, frameworks generally provide interfaces to existing framework features, which are called hot spots, and can be used to extend the framework appropriately to achieve specific application requirements [23]. As shown in Figure 2, the left side of the figure shows the main call structure of the function CDialog::Domodal of the MFC software framework, which mainly contains some framework class functions. The structure is abstracted into (a), that is, the framework of frozen spots. While the framework function open interface and user-generated structure through custom subclasses are hot spots, the developer-added custom functions according to actual needs. The red nodes in (b) are custom functions; the developer can customize the function by calling the framework functional structure in (a).

For a software framework,

F r a m e F u n c S e t

can be proposed as the set of functional structural features of that framework, whose formal expression can be represented by Equation (2).

F r a m e F u n c S e t = \{F u n c_{1} Set, F u n c_{2} Set, \dots, F u n c_{i} Set\} (i = total number of functions)

(2)

A

F r a m e F u n c S e t

corresponding to a software framework is composed with the

i

group of the function set

(F u n c_{i} S e t) .

The individual function set is formally represented by Equation (3).

F u n c_{i} S e t = \{FrmFunc, Γ, Ρ, Θ\}

(3)

FrmFunc = \{{FrmFunc}_{R}, {FrmFunc}_{1}, {FrmFunc}_{2}, \dots, {FrmFunc}_{u}\}

(4)

Γ = \{Γ_{R}, Γ_{1}, Γ_{2}, \dots, Γ_{u}\}

(5)

Γ_{k} = ℶ ({FrmFunc}_{k} A s s e m b l y)

(6)

FrmFunc

is the functions contained in the functional structure of the framework, which consists of one root function

{FrmFunc}_{R}

and

u

set(s) of related functions, as shown in Equation (4).

Γ

is the set of function embeddings in

{FrmFunc}_{u}

, and this set is the control group in similarity detection. As shown in Equation (5),

Γ

is a one-to-one correspondence to

FrmFunc

. Equation (6) is the formula for

Γ

to transform from function assembly to embedding.

ℶ ()

is the computational procedure for the similarity model to transform a function assembly into an embedding, where

k \in [1, u]

.

Ρ

records the virus activation point instruction information in

{FrmFunc}_{R}

, treating the instructions related to the jump as available virus activation instructions. By modifying the jump address of the infectable instruction during infection, it enables the software to trigger the corresponding virus function when calling the framework function.

Θ

is a directed functional structure diagram starting with

{FrmFunc}_{R .}

The edges of the diagram are direct or indirect call relationships. There are several possible cases of call relationships between functions in the software framework function set. Figure 3 is the possible call path between two functions, denoted by

e (s r c, v i a, d e s t)

, where

s r c

denotes the caller S,

d e s t

denotes the called party D in the call path, and

v i a

denotes the procedure function in the call path

V

.

e

denotes a directed call path from

s r c

to

d e s t

. To compose

Θ,

irrelevant functions in the function call path of the framework function set are ignored. For example, there are two paths from

{FrmFunc}_{S}

to

{FrmFunc}_{D}

in

F u n c_{i} S e t

:

e_{1} ({FrmFunc}_{S}, V I A_{1}, {FrmFunc}_{D})

and

e_{2} ({FrmFunc}_{S}, V I A_{2}, {FrmFunc}_{D})

, ignoring the irrelevant functions in the process

V I A

and

{FrmFunc}_{S}

.

4.2. Software Framework Functional Structure Recognition Model

The software framework functional structure recognition model consists of a lightweight BCSD model and a framework functional structure recognition algorithm. According to the technical difficulties and solution presented in the previous section, it is known that the similarity detection model needs to have ➀ a certain similarity precision rate, ➁ a simple model structure and the lowest possible operational overhead. Therefore, firstly, a binary similarity model with high precision for software framework recognition is selected, which is followed by further reduction in the model’s size and complexity. Meanwhile, the framework functional structure recognition algorithm based on the results of the similarity detection model is designed according to the characteristics of the software framework functional structure to achieve accurate recognition.

4.2.1. Evaluation of Binary Code Similarity Detection Models

In this section, we analyze the structure, principles and characteristics of the models; select a model suitable for the intelligent virus to carry; and then downsize the performance parameters of the model, including structural complexity, parameter size and operation time, through model knowledge distillation to construct a lightweight BCSD model that meets the needs of intelligent viruses. The binary code similarity models evaluated in this paper include Asm2Vec, SAFE, PalmTree, UniASM [24], etc., each of which adopts different neural network structures and implementation methods to accomplish BCSD.

Asm2Vec is a BCSD model based on the PV-DM model, which generates instruction vectors through neural networks to quantitatively measure the relationship between instructions to form the embedding of functions. It is good at excavating the lexical semantics of assembly instructions; however, the operation process treats a single instruction as three tokens, and data such as addresses as constants. Moreover, it relies on the name information of key functions. SAFE is a model built on Skip-Gram method and Self-Attention network, which contains a multilayer RNN neural network. Its operation process realizes the vectorization of instructions with the help of a large scale of vocabulary, so it is difficult to be used for viruses. PalmTree is the first model that applies BERT to instruction embedding, achieving BCSD through neural networks built by a multilayer Transformer Encoder. However, PalmTree by default uses a 12-layer Transformer network, so the computational efficiency is low and not suitable for viruses to carry. UniASM is the first UniLM-based binary function similarity detection model, which achieves function embedding with a self-attention network composed of a four-layer bidirectional Transformer. It uses vocabulary to vectorize instructions in the operation process, and then achieves function embedding through a neural network. In the disassembly preprocessing phase, UniASM can complete instruction standardization by simple principles and is more suitable as a model for viruses to carry.

According to the above comparison, UniASM is the most suitable similarity detection model in this group for viruses to carry. As shown in Figure 4, the instructions of the function are to remove noise words and mitigate the OUT-OF-VOCABULARY problem. Instruction vectorization is completed based on the principle that one instruction produces one token. Next, a function is converted into a sequence of tokens using a simple linear serialization method. Finally, this sequence is input into UniASM, which outputs the embedding corresponding to the function. The cosine similarity between the two function embeddings is calculated as the function similarity predicted by the model. The model is constructed by combining SimBERT and UniLM.

4.2.2. Lightweight Binary Code Similarity Detection Model

To achieve higher level of precision, current work on binary code similarity detection based on artificial intelligence usually requires constructing highly complex neural network models with a large size of parameter matrix and complex network structure. According to the results tested in Raphael Tang’s method, in the BiLSTM model obtained after knowledge distillation of the BERT model, the number of parameters is reduced by a factor of 349, from 335 M to 0.96 M, and increases the computational speed by a factor of 434. In this paper, we use knowledge distillation [25] compression method to implement a lightweight BCSD model by transferring the knowledge of the function similarity detection model from UniASM to a shallow neural network—BiLSTM.

The basic principle of the knowledge distillation compression method is to treat the original model as the Teacher Model (Net-T) and to build a simple neural network as the Student Model (Net-S). Ground-Truth is the true label of the training data. Net-T is the result after complete training by Ground-Truth. Net-S learns both the logit of Net-T and Ground-Truth, and finally obtains Net-S, which is a simple model inheriting the generalization ability of Net-T [26]. The process of knowledge distillation is shown in Figure 5.

For Net-T, we use the UniASM model that is pretrained and without fine-tuning, which consists of multiple transformer layers stacked on top of each other, to achieve an advanced level in function similarity detection tasks. After being input to UniASM, the function assembly first goes through normalization and tokenization to generate tokenized assembly instructions, and then through the token embedding layer, self-attention layer and function embedding layer sequentially to finally output the function embedding.

In contrast, our student model is a single-layer BiLSTM with a nonlinear classifier. After feeding the function instruction vector into the BiLSTM, the hidden states of the last step in each direction are concatenated and fed into a fully connected layer with rectified linear units (ReLUs), whose output is then passed to a softmax layer for classification.

In order to control the smoothness of the softmax output during knowledge distillation, the parameter distillation temperature T (an integer greater than 1) is used to amplify the distance between knowledges with small differences and improve the precision of the model after knowledge distillation. The softmax formula for increasing the distillation temperature T is shown in Equation (7).

q_{i} = \frac{e x p (\frac{z_{i}}{T})}{\sum_{j} e x p (\frac{z_{j}}{T})}

(7)

z_{i}

is the logits of the Net-T output. T is the distillation temperature. When T = 1, it is the standard softmax formula. The higher the T, the smoother the output probability distribution of softmax and the greater the entropy of its distribution, the more the information carried by the negative labels will be relatively amplified and the model training will pay more attention to the negative labels.

q_{i}

is the probability distribution of the predicted outcome. The training process of Net-S is shown in Equation (8).

L = α L_{s o f t} \cdot T^{2} + (1 - α) L_{h a r d}

(8)

This loss function is the weighting of

L_{s o f t}

(the difference in distribution between the output of Net-S and the output of Net-T) and

L_{h a r d}

(the difference in the distribution of Ground-Truth), where

α

is the distillation loss weight. The term of

L_{s o f t}

needs to be multiplied by

T^{2}

to keep the gradient consistency.

L_{s o f t}

and

L_{h a r d}

is represented in the form of cross-entropy, as shown in Equations (9) and (10).

L_{s o f t} = {||z^{(T)} - z^{(S)}||}_{2}^{2}

(9)

L_{h a r d} = - \sum_{i} t_{i} \log q_{i}^{(S)}

(10)

In Equation (9),

z^{(T)}

and

z^{(S)}

are, respectively, the output logits of Net-T and Net-S, and

t

is the Ground-Truth Label. The final form of the loss function is Equation (11).

L = α {||z^{(T)} - z^{(S)}||}_{2}^{2} \cdot T^{2} - (1 - α) \sum_{i} t_{i} \log q_{i}^{(S)}

(11)

Eventually, this paper constructs model Net-S through knowledge distillation. Compared with the original model, Net-S is 270 times smaller in terms of parameter size and 40 times smaller in terms of inference time for the same task. The above results imply that the BiLSTM-based Net-S can meet the virus’s requirements of intelligent models.

4.3. Software Framework Structure Recognition Algorithm

Although after knowledge distillation, the lightweight intelligent model achieves the simplification of the model and the reduction of operations complexity; it also hinders the precision of model similarity detection. Therefore, the software framework structure recognition algorithm is proposed to improve the recognition precision of the framework function. Firstly, the principle of the recognition algorithm proposed in this paper is explained from the mathematical point of view.

Suppose there are

J

sets of tasks of calculating the similarity of different functions: the original model generates a similarity result of

S_{j} (S \in (\frac{1}{2}, 1), j = 1, 2, \dots, J)

; the similarity obtained by the model after knowledge distillation for the same function pairs is

N_{j} = λ S_{j} (λ \in (0, 1), j = 1, 2, \dots, J)

, then the combined similarity formula for J sets of different functions is as in Equation (12).

S i m i l a r i t y = 1 - \prod_{j = 1}^{J} (1 - N_{j})

(12)

As shown in Figure 6, since

N_{j} \in (0.5, 1)

, when the mean value of

N_{j}

is 0.6 or 0.7, the

S i m i l a r i t y

will gradually converge to 100%

a s J

increases, i.e., the combined similarity will be close to

100 %

. According to the above algorithm, when the precision of similarity detection of a single function is not of a high level, the comprehensive similarity can be improved by similarity detection of a series of functions, i.e., the principle of similarity recognition algorithm for the set of functional structures in the software framework proposed in this paper.

As shown in Figure 7, based on the results of the similarity detection model, the algorithm extracts from the control flow graph (a) the functions that have certain similarity with the frame functional structure’s control flow graph (c) as nodes. With a tolerance of κ, if these nodes exhibit a subgraph structure similar to (c), as demonstrated in subgraph (b), the structure can be considered a framework functional structure of the target.

The software framework structure recognition Algorithm 1 is as follows.

Algorithm 1: Framework Structure Recognition Algorithm

1: Input: G, $F r a m e F u n c S e t$
2: Output: InfectionInfo = {InfectableIns_1, infectableIns_2, …}
3: Definition:
4: G: control flow graph of the software, the nodes are functions, the edges are function call relations, Entry is the entry point of CFG, $κ$ is the tolerance of function call path depth
5: $g_{r}^{d}$ : the subgraph whose root function is r and depth is $d$
6: $χ$ : Threshold of function nodes with similar subgraph structure
7: $ξ$ : Function similarity threshold. When the number of similar functions is greater than $χ$ , it is considered that the functional structure of the corresponding software framework is identified
8: procedure RecInFrmFunc ( $f u n c$ , $F r a m e F u n c S e t$ )
9: $E m b e d d i n g_{f u n c} = G e n e r a t e_E m b (f u n c)$ // generate embedding
10: For $F u n c S e t$ in $F r a m e F u n c S e t$
11: For $E m b e d d i n g_{f r m f u n c}$ in $F u n c S e t \to F u n c t i o n E m b e d d i n g s$
12: $I F C o s (E m b e d d i n g_{f u n c}, E m b e d d i n g_{f r m f u n c}) > ξ$ // compare function similarity
13: return TRUE
14: return FALSE
15: end procedure
16: Entry←G //get the entry point of the software control flow graph
17: Foreach func in G //Using DFS from Entry
18: MatchedFunc = 0 //number of similar functions in the subgraph
19: IF RecInFrmFunc ( $f u n c$ , $F r a m e F u n c S e t . {FrmFunc}_{R}$ )
20: r = func
21: MatchedFunc += 1
22: IF κ > 0
23: Foreach sub_func in G //Using DFS from func with depth κ
24: IF RecInFrmFunc (sub_func FrameFunc Set.FrmFunc_u)
25: MatchedFunc += 1
26: END IF
27: END IF
28: END IF
29: IF MatchedFunc $\geq χ$
30: InfectionInfo. Append( $F u n c S e t \to P$ )
31: END IF
32: Return InfectionInfo

The algorithm 1 traverses all functions

f u n c

in the target software by DFS and calculates the similarity between

f u n c

and

{FrmFunc}_{R}

in

F r a m e F u n c S e t .

When the similarity reaches the threshold

ξ

, the algorithm uses

f u n c

as the root function, traverses the subgraphs

g_{f u n c}^{κ}

in the control flow graph of the target software and identifies the framework functional structure subgraph in the target software by the subgraph structure matching algorithm. If the number of similarity functions of the subgraphs is greater than or equal to

χ

, then it is considered to have identified

t h e

corresponding functional structure to

{FrmFunc}_{R}

in

F u n c_{i} S e t

. Then, the algorithm returns the

P

of this functional structure as the available activation points of the virus.

5. Case Study

Taking the MFC software framework as the target framework, this section describes an infection case of the sample “A” through the proposed intelligent virus technology. The information of A is outlined in Table 2. The intelligent virus carries a framework structure recognition algorithm based on a lightweight BCSD model; this model is obtained by knowledge distillation of UniASM. The virus function code is a Shellcode capable of popping up a MessageBox, which can show clearly whether the virus function is triggered.

First, the virus disassembles the target software by with a mini-disassembly engine it carries. The engine disassembles the binary code with functions as the basic unit. Then, the virus traverses the control flow graph of the target software, inputs the disassembled function and embeds the disassembled function with the similarity model. Subsequently, it calculates the cosine similarity between the function embedding of the target software and the framework function embedding in

F r a m e F u n c S e t

that is corresponding to the MFC framework. The target software function corresponding to a similarity greater than ξ = 60% is taken as the similarity function. In this example, the similarity detection results of the functions related to the CMFCMenuButton::OnKillFocus function structure are shown in Table 3. It can be seen that the similarity detection model identifies five functional structure functions with similarity greater than 60%. These five functions are all part of a framework function structure, as shown in Figure 8.

According to the framework function structure recognition algorithm with κ = 4, the corresponding MFC frame function structure is identified among the similar functions detected above, and the available virus activation points are obtained according to P. In this example, the virus identifies the “KillFocus” functional structure of the MFC framework in the target and obtains the available activation points information based on the corresponding P in the

F u n c S e t

. The virus then tampers the above activation points via infector. Figure 9 shows the change in the execution flow of the software after infection. The call 0x41792D instruction of the original framework function OnkillFocus has been modified so that the VirusCode stored in a gap in the target will be triggered when the software executes the “KillFocus” function. This case successfully infects the “KillFocus” structure in the targeted software. During the execution of the infected software, the components that will invoke these framework functions could be triggered selectively to achieve the execution of the virus code. In this case, the intelligent virus infects the framework function “KillFocus” structure in the target. When the software GUI window loses focus, the virus code is triggered, i.e., a MessageBox pops up.

6. Experiments and Evaluation

This chapter validates the proposed intelligent virus model and evaluates its performance. The machine used for the experiments is a Windows 10 system with Intel Core i7-9700 CPU @ 3.00GHz and 32G RAM. The software frameworks selected for the test samples are Windows GUI software frameworks MFC and QT. The framework set data used by the similarity model in the experiments are from the MFC and QT frameworks. The amount of C++ code of the confirmatory virus is 0.956kLoC, the size of the virus is 0.73 MB and the virus infection is marked successful if the virus code is triggered. The experiments are carried out in the following three main aspects: (1) evaluating the performance of the lightweight similarity detection model, comparing the model attributes such as structure, parameter scale and similarity detection precision before and after knowledge distillation; (2) testing the intelligent virus’s ability to recognize the software framework functional structure in the target, setting sample groups by different criteria and evaluating the precision of the proposed method on known and unknown versions of software frameworks from multiple dimensions; (3) comparing the accuracy rate of software framework functional module recognition by the intelligent infection method and the feature code-based infection method.

6.1. Datasets and Indicators

6.1.1. Datasets

The experimental samples are all the source codes of various MFC and QT programs on GitHub and CodeProject, involving different categories of functions, including Password Manager, Word Catcher, Stream Analyzer, etc., compiled by setting different compilation options. There are 300 samples in total. The MFC framework samples are a combination of the framework version (9.0, 10.0, 14.2), compilation optimization level (Od, O1, O2, Ox) and framework type (single document, multidocument, dialog box), generated by Visual Studio compilation. There are 180 MFC framework samples in total. The QT framework is the combination of version (4.8, 5.7, 6.0), compilation optimization level (O0, O1, O2, O3) and framework type (GUI, Non-GUI), generated by QTCreator compilation. There are 120 QT framework samples in total. The samples are shown as Table 4.

6.1.2. Evaluation Metrics

For a

F r a m e F u n c S e t

containing k

F u n c_{i} S e t

, the precision, recall and F1-score of

F u n c_{i} S e t

are:

p r c (i) = \frac{T P_{i}}{T P_{i} + F P_{i}}, r e c (i) = \frac{T P_{i}}{T P_{i} + F N_{i}}, f 1 (i) = \frac{2 \cdot p r c (i) \cdot r e c (i)}{p r c (i) + r e c (i)}

(13)

Here

T P_{i}

(true positives) is the number of functions assigned correctly to

F u n c_{i} S e t

.

F P_{i}

(false positives) is the number of functions that do not belong to FrameFunc i but are assigned incorrectly to FrameFunc i; and

F N_{i}

(false negatives) is the number of functions that are not assigned to FrameFunc i but actually belong to it. The average of all FrameFuncs is calculated to derive the whole program’s precision, recall and F1 score as

p r c = \sum_{i = 1}^{k} p r c (i), r e c = \sum_{i = 1}^{k} r e c (i), f 1 = \sum_{i = 1}^{k} f 1 (i)

(14)

6.2. Comparison of Similarity Models before and after Knowledge Distillation

For the lightweight function similarity detection model, taking UniASM as Net-T and combining the Ground-Truth of the software framework function to derive the Net-S of BiLSTM structure by knowledge distillation. In order to check whether the parameters’ performance of the model obtained by knowledge distillation meets the requirements of the virus, the UniASM model before distillation and the BiLSTM network model after knowledge distillation are compared, in terms of the model composition, the number of parameters, the average precision rate of similarity detection for different optimization levels of functions and the average time to generate embeddings for 50 functions. The results are shown in Table 5.

The size of the Net-S model after knowledge distillation is significantly reduced, and the number of parameters is also reduced from 27 M to 0.1 M, which is a reduction of about 270 times. The average detection precision of functions of different optimization levels is reduced from 81% to 70%, but the operational efficiency is improved by about 42 times. From the above results, it can be seen that the knowledge distillation significantly reduces the model’s size and improves its computing efficiency, and the model still has a considerable level of precision in function similarity detection. Therefore, it can meet the requirements of the software framework functional structure recognition algorithms.

6.3. Ability to Recognize the Functional Structure of the Software Framework

This section tests the precision of the lightweight software framework function recognition model for known and unknown versions of software frameworks.

6.3.1. Ability to Recognize Known Versions of the Framework

In this set of experiments, the MFC framework samples are divided into three groups according to the framework version: 9.0, 10.0, 14.2; similarly, the QT framework samples are divided into three groups: 4.8, 5.7, 6.0.

F r a m e F u n c S e t

is constructed with the functional structures of MFC and QT under the current version Od/O0 optimization level, and 100 functional structures are randomly selected from them as the functional structures to be identified. Functional structure identification is performed and the precision and recall rate for each sample are calculated. The recognition results are shown in Table 6.

According to the data in Table 6, for two different software frameworks, the intelligent virus is strong in identifying known versions, with a precision rate of more than 88%. For those precision rate lower than 85%, after manual analysis, it is found that the changes in optimization levels and template classes lead to excessive changes in the functional structure, resulting in lower accuracy. In the virus infection process, it is enough to be able to identify a small amount of framework functional structure in the target, so the intelligent virus identification precision rate of about 90% on average is satisfactory.

6.3.2. Ability to Recognize Unknown Versions of the Framework

In this experiment, the

F r a m e F u n c S e t

is composed of all the functional structures of

M_{T}

and

Q_{T},

and 100 functional structures were randomly selected as the functional structures to be identified. Sample

M_{P}

and

Q_{P}

, respectively, represent the unknown versions of MFC and QT frameworks. The intelligent virus’ ability to recognize the framework structure is tested and the recognition precision is shown in Table 7.

Table 7 shows that the virus’ recognition precision rate of MFC 10.0 samples is above 92%. The precision rate for version 14.2 samples is relatively lower, but remains above 80%; the precision rate for QT v5.7 samples is above 90%; and the average precision rate for QT v6.0 samples also reaches 85%. From the results, it is clear that the intelligent virus is still able to recognize most of the functional structures in the face of software frameworks’ unknown versions. This is due to the intelligent model’s generalizing ability, which enables it to successfully recognize the functional structures, even it is an unknown version.

6.4. Smart Infection Method vs. Traditional Infection Methods

To further verify the intelligent virus’ infection capability, the performance of the intelligent infection method is compared with that of the feature-code-matching infection method. The function of MFC 10.0 is used to compose

F r a m e F u n c S e t .

The comparison group is selected as the feature code matching infection method. The first 32bytes of the 100 function functions in the

F r a m e F u n c S e t

are used as feature codes, and the feature codes are used to match the functions in the targeted software. All the MFC framework samples are taken as the infection object, and the number of functional structures located by the intelligent framework structure recognition algorithm infection method and the feature code matching infection method are compared. The results are shown in Figure 10.

According to the results in (a), the intelligent virus is able to successfully infect all samples of version 9.0 and 10.0, and it is also able to complete the infection for more than 90% of the samples of version 14.2. The feature-code-matching method achieves a high success rate for version 10.0 samples, and a success rate of more than 60% for older versions of 9.0, while the success rate drops to less than 20% for unknown version 14.2. In (b) and (c), the intelligent virus’ infection success rates for different optimization levels and type samples are significantly higher than those of the feature-code-matching method. Combining the results and infection mechanisms, we analyze that since the infection method of feature code matching can only identify established functions, it is difficult to maintain the success rate in unfamiliar version frameworks; the structure recognition algorithm of the intelligent virus carrying the BCSD model has a certain generalization ability, so it can still complete the infection of unknown versions in most cases, even if only the information of a single software framework version is available.

7. Discussion

In this paper, we focus on the virus’ recognition and infection of the functional modules in software frameworks, which have relatively fixed structures. The extraction of frameworks in software is achieved by locating these functional structures. In this paper, only functional functions are extracted as elements of frameworks, while in fact, numerous other elements in frameworks can also help to further improve the recognition precision of software frameworks. We will consider the integration of data structures such as virtual function tables into the framework functional structure to improve the precision of framework structure recognition. The BCSD technique is the current mainstream approach in the field of binary semantic analysis. On the other hand, the powerful generalization ability empowers the virus to infect unknown versions of the framework. The lightweight similarity model after knowledge distillation is suitable for the virus’ requirement in aspects of volume and computing efficiency, and thus can achieve high recognition precision when combined with the framework functional structure recognition algorithm. The intelligent method proposed in this paper may have the problem of low confidence and interpretability of detection results; however, compared with pattern matching, the method in this paper can greatly reduce the leakage rate of detection. From the industrial point of view, the involvement of intelligent models can make the detection results poorly interpretable; this makes the intelligent virus cause some unintended problems in practical utilization. Therefore, improving the credibility of the model detection results and reducing the negative effects of interpretability is a focus of the next study. In cyberspace, the offensive and defensive parties are always developing and checking each other. The technology itself is neutral. Both the offensive and defensive sides may apply advanced technology in their own fields, and both virus technology and virus detection technology are to become more intelligent. For the confirmatory virus proposed in this paper that can intelligently infect the software framework functional module, the following perspective can be considered in detection: binary code similarity detection technology is used to detect the situation where the framework functional module in the executable is tampered by the virus. If unknown modifications to the software framework are detected, this proves that it may be caused by infection.

8. Conclusions

In this paper, we summarize the challenges of developing intelligent viruses and propose solutions accordingly. In order to realize viruses carrying intelligent models, we propose to implement the recognition of software framework functional structures by means of a lightweight binary similarity model. The conventional BCSD model is relatively large in size and inefficient in computing time, so the model downsizing is achieved by knowledge distillation. As the lightweight similarity detection model shows a low level of recognition precision, we design a comprehensive recognition algorithm based on the similarity of the framework functional structure, and thus enable the virus to complete precise infection of the software framework functional structure. In this paper, we design a complete confirmatory virus carrying a lightweight BCSD model, which can realize precise identification and infection of software framework functions. Compared with traditional infection methods, it has stronger generality and generalization ability, and is even able to complete the infection of partially unknown versions of the framework. The confirmatory virus proposed in this paper provides a new reference for the research of detection technology for possible intelligent viruses, which can help improve the study of detection techniques against such viruses.

Author Contributions

Conceptualization, W.G.; data curation, W.G.; formal analysis, W.G. and H.S.; investigation, Y.L.; methodology, W.G., H.S., Y.G., Y.H. and H.Z.; project administration, W.G. and H.S.; resources, W.G., Y.G., Y.H. and H.Z.; software, W.G. and Y.G.; supervision, H.S.; validation, W.G. and Y.H.; visualization, W.G.; writing – original draft, W.G.; writing – review and editing, W.G., H.S., Y.G., Y.H. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, Y.; Li, S.; Yuan, Y.; Liu, C.; Feng, W.; Yu, M. Research on Virus Detection Technology Based on Heuristic Model. In Proceedings of the 6th International Conference on Advanced Design and Manufacturing Engineering (ICADME. 2017), Zhuhai, China, 23–24 July 2017; pp. 560–564. [Google Scholar] [CrossRef] [Green Version]
Ling, X.; Wu, L.; Zhang, J.; Qu, Z.; Deng, W.; Chen, X.; Wu, Y. Adversarial Attacks against Windows PE Malware Detection: A Survey of the State-of-the-Art. arXiv 2021, in press. [Google Scholar] [CrossRef]
Hu, W.; Tan, Y. Black-box attacks against RNN based malware detection algorithms. In Proceedings of the Workshops at the 32nd AAAI Conference on Artificial Intelligence, New Orleans, MI, USA, 2–7 February 2018; pp. 245–251. [Google Scholar]
Das, A.; Verma, R. Automated email generation for targeted attacks using natural language. arXiv 2019, in press. [Google Scholar] [CrossRef]
Seymour, J.; Tully, P. Weaponizing data science for social engineering: Automated E2E spear phishing on Twitter. In Proceedings of the Black Hat, Las Vegas, NE, USA, 30 July–4 August 2016; pp. 1–8. [Google Scholar]
Homayoun, S.; Ahmadzadeh, M.; Hashemi, S.; Dehghantanha, A.; Khjayami, R. BoTShark: A deep learning approach for botnet traffic detection. In Cyber Threat Intelligence; Springer: Berlin/Heidelberg, Germany, 2018; pp. 137–153. [Google Scholar]
Tobiyama, S.; Yamaguchi, Y.; Shimada, H.; Ikuse, T.; Yagi, T. Malware detection with deep neural network using process behavior. In Proceedings of the 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), Atlanta, GA, USA, 10–14 June 2016; Volume 2, pp. 577–582. [Google Scholar]
Wang, Y.; Stokes, J.W.; Marinescu, M. Neural malware control with deep reinforcement learning. In Proceedings of the IEEE Military Communications Conference (MILCOM 2019), 12–14 November 2019; pp. 1–8. [Google Scholar]
Petty, M.D.; Kim, J.; Barbosa, S.E.; Pyun, J.J. Software frameworks for model composition. Model. Simul. Eng. 2014, 2014, 492737. [Google Scholar] [CrossRef] [Green Version]
Jacobson, I.; Griss, M.; Jonsson, P. Software Reuse: Architecture, Process and Organization for Business Success; ACM Press/Addison-Wesley Publishing Co.: New York, NY, USA, 1997; pp. 86–89. [Google Scholar]
Petty, M.D.; Morse, K.L.; Riggs, W.C. A reuse lexicon: Terms, units, and modes in M&S asset reuse. In Proceedings of the Simulation Interoperability Workshop, Orlando, FL, USA, 20–24 September 2010. [Google Scholar]
Allen, G.; Daly, J.J.; Heaphy, M. Making modeling and simulation reuse attractive. In Proceedings of the Interservice/Industry Training, Simulation, and Education Conference, Orlando, FL, USA, 2–5 December 2013; pp. 2161–2171. [Google Scholar]
Priya, B.; Rahul, C.; Ankesh, G. A Review Study on Computer Virus. World J. Res. Rev. 2022, 14, 39–44. [Google Scholar]
Skoudis, E.; Zeltser, L. Malware: Fighting Malicious Code; Prentice Hall Professional: Hoboken, NJ, USA, 2004. [Google Scholar]
Blogs, C. A Brief History of Malware Obfuscation: Part 1 of 2. Available online: http://blogs.cisco.com/security/a_brief_history_of_malware_obfuscation_part_1_of_2 (accessed on 8 July 2022).
Kirat, D.; Jang, J.Y.; Stoecklin, M. DeepLocker-concealing targeted attacks with AI locksmithing. In Proceedings of the Black Hat, Las Vegas, VE, USA, 8–9 August 2018. [Google Scholar]
Andrea, M.; Mariano, G.; Xabier, U.P. How Machine Learning Is Solving the Binary Function Similarity Problem. USENIX Security Symposium. 2022, pp. 2099–2116. Available online: https://www.usenix.org/conference/usenixsecurity22/presentation/marcelli (accessed on 8 July 2022).
Steven, H.; Ding, H.; Benjamin, C.; Fung, M.; Philippe, C. asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP) IEEE, San Francisco, CA, USA, 19–23 May 2019; pp. 472–489. [Google Scholar]
Kenter, T.; Borisov, A.; Rijke, M. Siamese CBOW: Optimizing word embeddings for sentence representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 7–12 August 2016; p. 941951. [Google Scholar]
Massarelli, L.; Luna, G.A.D.; Petroni, F.; Baldoni, R.; Querzoni, L. Safe: Self-attentive function embeddings for binary similarity. In Proceedings of the Conference on Detection of Intrusions and Malware & Vulnerability Assessment, Gothenburg, Sweden, 19–20 June 2019; pp. 309–329. [Google Scholar]
Li, X.; Qu, Y.; Yin, H. Palmtree: Learning an assembly language model for instruction embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual, 15–19 November 2021; pp. 3236–3251. [Google Scholar]
Mili, H.; Mili, A.; Yacoub, S.; Addy, E. Reuse Based Software Engineering: Techniques, Organization, and Controls; John Wiley & Sons: New York, NY, USA, 2001. [Google Scholar]
Oliveira, T.C.; Alencar, P.; Cowan, D. ReuseTool-An extensible tool support for object-oriented framework reuse. J. Syst. Softw. 2011, 84, 2234–2252. [Google Scholar] [CrossRef]
Gu, Y.; Shu, H.; Hu, F. UniASM: Binary Code Similarity Detection without Fine-tuning. arXiv 2022, in press. [Google Scholar] [CrossRef]
Tang, R.; Lu, Y.; Liu, L.; Mou, L.; Vechtomova, O.; Lin, J. Distilling task-specific knowledge from bert into simple neural networks. arXiv 2019, in press. [Google Scholar] [CrossRef]
Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; Zhou, M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Proc. Adv. Neural Inf. Process. Syst. 2020, 33, 5776–5788. [Google Scholar]

Figure 1. The workflow of the novel virus model.

Figure 2. Frozen spots and hot spots in the framework. (a) is an abstraction of the function structure on the left, (b) is a case of frozen points and hot spots.

Figure 3. The call path between two functions.

Figure 4. UniASM backbone network.

Figure 5. Distillation of Net-T to Net-S.

Figure 6. Integrated similarity function.

Figure 7. Principles of framework functional structure recognition algorithms. (a) is the CFG with tolerance κ in Target. (b) is the relational graph of software framework similarity function abstracted from (a), and (b) is also a subgraph of (c). (c) is the CFG of a functional structure in Framefunc.

Figure 8. The frame structure corresponding to CMFCMenuButton::OnKillFocus.

Figure 9. Execution of infected software.

Figure 10. Infection success ratio of samples by two infection methods. (a) is the infection success ratio of the two methods for different framework versions, (b) is the infection success rate of the two methods for different compilation optimization levels, and (c) are the infection success ratio of the two methods for different framework template types.

Table 1. The application of intelligent technology in network attack and defense.

Year	Literature	Application Scene	Validation Engine	Core Methodology
2017	[3]	Automated evasion	RNN, GAN	Unrelated API sequence insertion
2019	[4]	Automated phishing	RNN	Email insertion of malicious data
2016	[5]	Automated phishing	LSTM	Strongly targeted phishing post generation
2018	[6]	Malicious behavior detection	CNN	Traffic detection
2016	[7]	Malicious behavior detection	RNN, CNN	Process behavior detection
2019	[8]	Malicious code detection	DRN	Learn the best time to pause file execution

Table 2. Basic information of sample A.

MD5	d4d8311428c4cee8b1876562233a5e59
File size	3039.5 KB
Framework version	MFC v14.2
Optimization level	O1
Type	SingleDoc

Table 3. Part of similarity detection results.

Function	Address	Similarity Detection Results
CMFCMenuButton::OnKillFocus	0x5136A8	77%
CWnd::Default	0x41792D	69%
CThreadLocal::CreateObject	0x40FF3F	75%
CMFCButton::OnKillFocus	0x4D4989	69%
CThreadLocalObject::GetData	0x41D178	62%
…	…	…

Table 4. Software framework sample information.

Label	Framework	Quantity	Average Size	Version	Optimization Level	Type
$M_{T}$	MFC	60	3162 KB	9.0	Od/O1/O2/Ox	SingleDoc/MultiDoc/Dialog
$M_{P}$	MFC	120	4378 KB	10.0/14.2	Od/O1/O2/Ox	SingleDoc/MultiDoc/Dialog
$Q_{T}$	QT	40	1625 KB	4.8	O0/O1/O2/O3	GUI/Non-GUI
$Q_{P}$	QT	80	1852 KB	5.7/6.0	O0/O1/O2/O3	GUI/Non-GUI

Table 5. Comparison of model parameters before and after knowledge distillation.

Model	Structure	Parameter Size	Average Inference Time	Precision	F1
Net-T	Transformer	27 M	0.42 s	82.3%	72.1%
Net-S	BiLSTM	0.1 M	0.10 s	70.7%	64.4%

Table 6. Recognition accuracy rate of intelligent virus of framework function of different versions of samples. The left four columns of data are the precision of each group of tests, and the right two columns are the average precision and F1.

Framework	Version	Type	Optimization Level				Precision	F1
Framework	Version	Type	Od/O0	O1	O2	Ox/O3	Precision	F1
MFC	9.0	SingleDoc	0.97	0.93	0.94	0.88	0.93	0.73
		MultiDoc	0.95	0.94	0.91	0.84	0.91	0.56
		Dialog	0.96	0.91	0.87	0.82	0.89	0.60
	10.0	SingleDoc	0.97	0.92	0.89	0.82	0.90	0.68
		MultiDoc	0.97	0.90	0.87	0.85	0.90	0.72
		Dialog	0.96	0.93	0.90	0.84	0.91	0.85
	14.2	SingleDoc	0.95	0.92	0.90	0.83	0.90	0.60
		MultiDoc	0.96	0.90	0.86	0.86	0.90	0.68
		Dialog	0.97	0.93	0.88	0.86	0.91	0.56
QT	4.8	GUI	0.96	0.93	0.90	0.89	0.92	0.80
	4.8	Non-GUI	0.94	0.93	0.90	0.84	0.90	0.64
	5.7	GUI	0.94	0.92	0.90	0.86	0.91	0.64
	5.7	Non-GUI	0.96	0.90	0.81	0.84	0.88	0.59
	6.0	GUI	0.92	0.90	0.87	0.84	0.88	0.71
	6.0	Non-GUI	0.95	0.92	0.88	0.84	0.90	0.00

Table 7. prc of intelligent virus for frame structure in samples with unknown framework version. The left four columns of data are the precision of each group of tests, and the right two columns are the average precision and F1.

Framework	Version	Type	Optimization Level				Precision	F1
Framework	Version	Type	O0/Od	O1	O2	O3/Ox	Precision	F1
MFC	10.0	SingleDoc	0.97	0.92	0.93	0.92	0.94	0.73
		MultiDoc	0.91	0.90	0.94	0.92	0.92	0.85
		Dialog	0.95	0.96	0.90	0.98	0.95	0.65
	14.2	SingleDoc	0.89	0.83	0.84	0.82	0.85	0.77
		MultiDoc	0.88	0.85	0.89	0.83	0.86	0.63
		Dialog	0.87	0.85	0.85	0.85	0.86	0.63
QT	5.7	GUI	0.92	0.92	0.90	0.91	0.91	0.60
	5.7	Non-GUI	0.95	0.95	0.91	0.94	0.94	0.69
	6.0	GUI	0.85	0.85	0.86	0.85	0.85	0.74
	6.0	Non-GUI	0.86	0.87	0.88	0.84	0.86	0.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, W.; Shu, H.; Gu, Y.; Huang, Y.; Zhao, H.; Li, Y. A Novel Virus Capable of Intelligent Program Infection through Software Framework Function Recognition. Electronics 2023, 12, 460. https://doi.org/10.3390/electronics12020460

AMA Style

Guo W, Shu H, Gu Y, Huang Y, Zhao H, Li Y. A Novel Virus Capable of Intelligent Program Infection through Software Framework Function Recognition. Electronics. 2023; 12(2):460. https://doi.org/10.3390/electronics12020460

Chicago/Turabian Style

Guo, Wang, Hui Shu, Yeming Gu, Yuyao Huang, Hao Zhao, and Yang Li. 2023. "A Novel Virus Capable of Intelligent Program Infection through Software Framework Function Recognition" Electronics 12, no. 2: 460. https://doi.org/10.3390/electronics12020460

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Virus Capable of Intelligent Program Infection through Software Framework Function Recognition

Abstract

1. Introduction

2. Background and Related Work

2.1. Software Framework

2.2. Virus Infection Technology

2.3. Binary code Similarity Detection Models

3. Challenges and Solutions

3.1. Challenges

3.2. Solutions

4. Software Framework Function-Aware Virus Infection Technique

4.1. Software Framework Functional Structure Set

4.2. Software Framework Functional Structure Recognition Model

4.2.1. Evaluation of Binary Code Similarity Detection Models

4.2.2. Lightweight Binary Code Similarity Detection Model

4.3. Software Framework Structure Recognition Algorithm

5. Case Study

6. Experiments and Evaluation

6.1. Datasets and Indicators

6.1.1. Datasets

6.1.2. Evaluation Metrics

6.2. Comparison of Similarity Models before and after Knowledge Distillation

6.3. Ability to Recognize the Functional Structure of the Software Framework

6.3.1. Ability to Recognize Known Versions of the Framework

6.3.2. Ability to Recognize Unknown Versions of the Framework

6.4. Smart Infection Method vs. Traditional Infection Methods

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI