E-SAWM: A Semantic Analysis-Based ODF Watermarking Algorithm for Edge Cloud Scenarios

Zu, Lijun; Li, Hongyi; Zhang, Liang; Lu, Zhihui; Ye, Jiawei; Zhao, Xiaoxia; Hu, Shijing

doi:10.3390/fi15090283

Open AccessArticle

E-SAWM: A Semantic Analysis-Based ODF Watermarking Algorithm for Edge Cloud Scenarios

by

Lijun Zu

^1,2,3,

Hongyi Li

¹,

Liang Zhang

^4,*,

Zhihui Lu

^1,3

,

Jiawei Ye

^1,*,

Xiaoxia Zhao

² and

Shijing Hu

¹

School of Computer Science, Fudan University, Shanghai 200433, China

²

China UnionPay Co., Ltd., Shanghai 201210, China

³

Institute of Financial Technology, Fudan University, Shanghai 200433, China

⁴

Huawei Technologies Co., Ltd., Nanjing 210012, China

^*

Authors to whom correspondence should be addressed.

Future Internet 2023, 15(9), 283; https://doi.org/10.3390/fi15090283

Submission received: 12 July 2023 / Revised: 11 August 2023 / Accepted: 18 August 2023 / Published: 22 August 2023

(This article belongs to the Special Issue Edge-Cloud Computing and Federated-Split Learning in the Internet of Things)

Download

Browse Figures

Versions Notes

Abstract

:

With the growing demand for data sharing file formats in financial applications driven by open banking, the use of the OFD (open fixed-layout document) format has become widespread. However, ensuring data security, traceability, and accountability poses significant challenges. To address these concerns, we propose E-SAWM, a dynamic watermarking service framework designed for edge cloud scenarios. This framework incorporates dynamic watermark information at the edge, allowing for precise tracking of data leakage throughout the data-sharing process. By utilizing semantic analysis, E-SAWM generates highly realistic pseudostatements that exploit the structural characteristics of documents within OFD files. These pseudostatements are strategically distributed to embed redundant bits into the structural documents, ensuring that the watermark remains resistant to removal or complete destruction. Experimental results demonstrate that our algorithm has a minimal impact on the original file size, with the watermarked text occupying less than 15%, indicating a high capacity for carrying the watermark. Additionally, compared to existing explicit watermarking schemes for OFD files based on annotation structure, our proposed watermarking scheme is suitable for the technical requirements of complex dynamic watermarking in edge cloud scenario deployment. It effectively overcomes vulnerabilities associated with easy deletion and tampering, providing high concealment and robustness.

Keywords:

edge cloud; OFD files; semantic analysis; dynamic watermarking

1. Introduction

In the dynamic landscape of the digital economy, commercial banks face the challenge of sharing a significant amount of financial data with clients’ designated digital applications in electronic file format [1]. As a novel file format, the open fixed-layout document (OFD) format has gained popularity within the financial industry [2]. It offers unique advantages for various financial processes, including electronic receipts and financial statements, which are in increased demand, leveraging the capabilities of the OFD format in the domain of financial management.

In addition to the advantages of data sharing, the prevention of data leakage has emerged as a growing concern [3]. Currently, most banks rely on contractual agreements to enforce compliance and security measures during the transmission and utilization of data by application parties, lacking sufficient technical support. In cases of data leakage within the application scenario, banks encounter difficulties in promptly and accurately assigning responsibility to the relevant application parties, resulting in detrimental consequences for customers, banks, and the overall financial system. Given the increasing openness of the scenario ecosystem, relying solely on contractual agreements becomes increasingly challenging for banking institutions to mitigate risks associated with data sharing. It is imperative to incorporate additional technical support to fortify data security measures and enable effective prevention and monitoring of data security risks. The integration of watermarks in OFD files plays a crucial role in ensuring timely traceability and accountability following instances of data leakage. Currently, there is a lack of a comprehensive security framework in the financial industry that effectively addresses the challenges of data leakage prevention and tracking during the transmission and processing of financial data between cloud edges. This issue becomes particularly evident in the context of the OFD file format landscape, where the technology is still in its early stages of application. Furthermore, there is a dearth of dynamic watermarking algorithms that possess high transparency, concealment, robustness, and the capacity to carry substantial financial antileakage tracking information.

Banking and financial institutions heavily rely on data centers to facilitate financial services in conjunction with cloud-based scenarios. Within this service framework, the banking system is responsible for processing the entire collection of the bank’s financial documents in the OFD format before transmitting them to the service scenario side for subsequent business processing. Ensuring the security of financial data during this processing stage is of utmost importance. We present E-SAWM, an implicit watermarking service framework for OFD files based on semantic analysis in an edge cloud computing scenario. Scenario-side edge cloud computing, an extension of the banking institution-side cloud computing center, is positioned closer to the user scenarios. In financial data-sharing scenarios, deploying data protection edge services on the scenario side enables accelerated and secure data processing. By leveraging the close proximity of edge computing to the data and utilizing its real-time capabilities, the scenario side allows for the application of more advanced security algorithms to meet diverse and higher-level financial data protection requirements, ensuring enhanced data security processing.

In this paper, we propose an OFD implicit watermarking framework, E-SAWM, based on semantic analysis in edge cloud scenarios. To ensure the security of embedded watermarks, we leverage the inherent semantic properties of the internal structured files of OFD. By using semantic analysis techniques, we generate highly authentic pseudostatements that closely resemble genuine content. These pseudostatements are then distributed efficiently and seamlessly integrated into the redundant bits of the OFD structured files. The proposed method offers the following significant advantages:

1.: Transparency: E-SAWM ensures zero interference with the structure and display of the OFD file, preserving its original integrity;
2.: Concealment: E-SAWM utilizes transformations and realistic pseudosentences to effectively conceal the watermark, impeding detection by potential attackers;
3.: Robustness: E-SAWM employs distributed embedding of the watermark across multiple structural files and selects distributed redundant bits within the same file. This approach enhances the robustness of the watermark and hinders attackers from destroying the watermark information in the OFD file;
4.: High capacity: E-SAWM supports unlimited watermark information in terms of length and quantity, enabling the embedding of a substantial amount of watermark data.

The rest of this paper is structured as follows. In Section 2, we present an overview of the related work in this field. Section 3 introduces the architecture of the open bank data service based on edge cloud and presents the OFD implicit watermarking algorithm scheme that relies on semantic analysis. In Section 4, we present the experimental results and analyze the outcomes in the context of real-world scenarios in the financial industry. Finally, in Section 5, we conclude the paper and provide an outlook for future research in this domain.

2. Related Work

2.1. Application of Edge Computing in the Domain of Financial Data Protection

Since 2015, edge cloud computing has emerged as a prominent technology, positioned on the Gartner technology maturity curve and experiencing rapid industrialization and growth. Edge computing represents a distributed computing paradigm that positions primary processing and data storage at the edge nodes of the network. According to the Edge Computing Industry Alliance [4], it is an open platform integrating network, computing, storage, and application core capabilities at the edge of the network, in close proximity to the data source. This setup enables the provision of intelligent edge services to meet crucial requirements for industrial digitization, including agile connectivity, real-time services, data optimization, application intelligence, and security and privacy protection. International standards organization ETSI [5] defines edge computing as the provisioning of IT service environments and computing capabilities at the network edge, aiming to reduce latency in network operations and service delivery, ultimately enhancing the user experience. Infrastructure for edge cloud computing encompasses various elements, such as distributed IDCs, carrier communication network edge infrastructure, and edge devices like edge-side client nodes, along with their corresponding network environments.

Serving as an extension of cloud computing, edge cloud computing provides localized computing capabilities and excels in small-scale, real-time intelligent analytics [6]. These inherent characteristics make it highly suitable for smart applications, where it can effectively support small-scale smart analytics and deliver localized services. In terms of network resources, edge cloud computing assumes the responsibility for data in close proximity to the information source. By facilitating local storage and processing of data, it eliminates the need to upload all data to the cloud [7]. Consequently, this technology significantly reduces the network burden and substantially improves the efficiency of network bandwidth utilization. In application scenarios that prioritize data security, especially in sectors such as finance, edge clouds offer enhanced compliance with stringent security requirements. By enabling the storage and processing of sensitive data locally, edge clouds effectively mitigate the heightened risks of data leakage associated with placing such critical information in uncontrollable cloud environments.

In the evolving landscape of the financial industry, there is a paradigm shift toward open banking, often referred to as banking 4.0. Departing from the traditional customer-centric approach, open banking places emphasis on user centricity and advocates for data sharing facilitated by technical channels such as APIs and SDKs. Its primary goal is to foster deeper collaboration and forge stronger business connections between banks and third-party institutions, which enables the seamless integration of financial services into customers’ daily lives and production scenarios. The overarching objective is to optimize the allocation of financial resources, enhance service efficiency, and cultivate mutually beneficial partnerships among multiple stakeholders. An illustrative example of this paradigm shift is evident in bank card electronic payment systems, where the deployment of secure and encrypted POS machines at the edge enables convenient electronic payments [8].

Extensive research has been conducted to address the security challenges in edge cloud environments. M. Ati et al. [9] proposed an enhanced cloud security solution to enhance data protection against attacks. Similarly, L. Chen et al. [10] proposed a heterogeneous endpoint access authentication mechanism for a three-tier system (“cloud-edge-end”) in edge computing scenarios, which aimed to support a large number of endpoint authentication requests while ensuring the privacy of endpoint devices. Building upon this, Z. Song et al. [11] introduced a novel attribute-based proxy re-encryption approach (COAB-PRE) that enables data privacy, controlled delegation, bilateral access control, and distributed access control capabilities for data sharing in cloud edge computing. On the other hand, G. Cui et al. [6] developed a data integrity checking and corruption location scheme known as ICL-EDI, which focuses on efficient data integrity checking and corruption location specifically for edge data. Additionally, Z. Wang et al. [12] introduced a flexible time-ordered threshold ring signature scheme based on blockchain technology to secure collected data in edge computing scenarios, ensuring a secure and tamper-resistant environment. However, to the best of our knowledge, the existing research has not extensively addressed the topic of leakage tracking techniques for sensitive data in edge computing scenarios.

2.2. Edge Cloud-Based Financial Regulatory Outpost Technology

The open sharing of data brings inherent risks to personal privacy data leakage. In the financial industry, it is crucial to ensure compliance with regulations such as the Data Security Law and the Personal Information Protection Law while conducting business operations. To tackle this challenge, we propose the deployment of regulatory outpost at the edge of the data application side, with a specific focus on third-party institutions, which aims to enhance the security and compliance of open banking data within the application side of the ecosystem.

Regulatory outpost is a standalone software system designed to monitor data operations on the application side, aiming to prevent data violations and mitigate the risk of data leakage. The system offers comprehensive monitoring capabilities throughout different stages of the application’s data operations, including data storage, reading, and sharing, as well as intermediate processing tasks, such as sensitive data identification, desensitization, and watermarking. In addition, the regulatory outpost maintains meticulous records of all user data operation logs, facilitating log audits, leak detection, and generation of data flow maps and enabling situational awareness regarding data security.

In light of the above considerations, regulatory outpost operates at the edge side of data processing and plays a significant role in the data processing process. To ensure optimal efficiency and cost-effectiveness, the deployment of regulatory outposts should satisfy the following requirements in the context of data operations:

1.: Elastic and scalable resource allocation: Data processing applications necessitate computational resources, but the overall data volume tends to vary. For instance, during certain periods, the data volume processed by the application side may increase, requiring more CPU performance, memory, hard disk space, and network throughput capacity. Conversely, when the processing data volume decreases, these hardware resources remain underutilized, leading to wastage. Therefore, it is essential for regulatory outposts to support the elastic scaling of resources to minimize input costs associated with data processing operations;
2.: Low bandwidth consumption cost and data processing latency: The application’s data traffic is directed through the regulatory outpost, which can lead to increased bandwidth consumption costs and higher network latency, especially if the outpost is deployed in a remote location like another city. The current backbone network, which is responsible for interconnecting cities, incurs higher egress bandwidth prices, and its latency is relatively higher compared to the metropolitan area network and local area network. To minimize the impact on the application experience, it is essential to maintain low bandwidth utilization costs and minimize data processing latency;
3.: Data compliance: Due to concerns about open banking data leakage, the application side tends to prefer localized storage of open banking data to the greatest extent possible, which enables the application side to more conveniently monitor the adequacy of security devices and the effectiveness of security management protocols.

Edge clouds provide significant advantages due to their proximity to data endpoints, including cost savings in network bandwidth, low latency in data processing, and improved data security. Moreover, they offer the scalability, elasticity, and resource-sharing benefits commonly associated with centralized cloud computing. Hence, deploying regulatory outposts in the edge cloud is a logical decision. Figure 1 showcases an example deployment scenario.

The regulatory outpost consists of two components: “regulatory outpost—data input processing” and “regulatory outpost—data export processing”. The specific data processing work flow is illustrated in Figure 2.

2.2.1. Regulatory Outpost—Data Input Processing

This component automatically identifies sensitive data on among inflowing data and generates a data asset map, data desensitization policy, a permission control policy for the zero trust module, and a data destruction policy based on the identified sensitive data. To cater to the frequent viewing of short-term data such as logs by application-side users, a two-tier data storage approach is employed. The desensitized data are saved in a short-term database, while a full-volume database retains all the data. In cases in which the data contain highly confidential information, they are encrypted prior to being written into the full-volume database.

2.2.2. Regulatory Outpost—Data Export Processing

In the data access scenario, the zero trust module of the regulatory outpost plays a critical role in verifying access privileges for data users. When accessing data from a short-term database, open banking data are transmitted to the data user after incorporating watermark information, such as the data user’s identity, data release date, and usage details. However, if the data are retrieved from the full-volume database, they must undergo desensitization based on the desensitization policy before the inclusion of watermark information and subsequent transmission to the data user. To ensure accountability, the log auditing module captures and logs all data operations for auditing purposes. The audit results are then utilized to generate data flow maps, detect instances of data leakage, and provide valuable insights into data security situational awareness. These insights facilitate the identification of existing data security risks and offer suggestions for improvement measures.

2.3. Document Watermarking Techniques

The file is a prominent data format used for data sharing. In the process of sharing files from the cloud (bank side) to the edge cloud (application side), it becomes crucial to monitor potential data leakage at each step. This concern is particularly relevant for the edge side, where the development of a watermarking algorithm that possesses high levels of transparency, concealment, robustness, and capacity has become a subject of significant academic interest.

Electronic document formats can be categorized into two types: streaming documents and versioned documents. Streaming documents, such as Word and TXT files, support editing, and their display may vary depending on the operating system and reader version. On the other hand, versioned documents have a fixed layout that remains consistent across different operating systems and readers.

OFD is an innovative electronic document format that conforms to the “GB/T 33190-2016 Electronic Document Storage and Exchange Format—Layout Documents” standard [13]. OFD was specifically developed to fulfill the demands of effectively managing and controlling layout documents while ensuring their long-term preservation. By offering a dependable and standardized format, OFD facilitates the maintenance of consistent layouts and supports the preservation of electronic documents. Our work primarily concentrates on the watermarking technology for OFD files, which serves as the prevalent file format utilized in the financial sector.

The OFD file format adopts XML (Extensible Markup Language) to define document layout, employing a “container + document” structure to store and describe data. The content of a document is represented by multiple files contained within a zip package, as illustrated in Figure 3. A detailed analysis and explanation of the internal structure components of an OFD file are provided in Table 1.

In the realm of layout document formats, OFD and PDF are widely utilized. Watermarking techniques for layout documents can be categorized into several methods:

1.: Syntax- or semantics-based approaches: leveraging natural language processing techniques to replace equivalent information, perform morphological conversions, and adjust statement structures to facilitate watermark embedding [6,14];
2.: Format-based approaches encompass techniques such as line shift coding, word shift coding, space coding, modification of character colors, and adjustment of glyph structures [15];
3.: Document structure-based approaches leverage PDF structures like PageObject, imageObject, and cross-reference tables, enabling the embedding of watermarks while preserving the original explicit location [16].

The field of PDF watermarking has reached a relatively mature stage of development. However, watermarking algorithms that rely on syntax and format modifications may alter the original text content, which conflicts with the requirement of preserving the originality of digital products. Consequently, watermarking algorithms based on the document structure are commonly employed to add watermarks to PDF files. ZHONG Zheng-yan et al. [17] presented a novel method for watermarking PDF documents, which involves embedding watermarks based on the redundant identifier found at the end of the PDF cross-reference table. By leveraging this technique, the original text content and display of the PDF remain unaltered, thereby achieving complete transparency when viewed using PDF readers. Kijun Han et al. [18] added watermarks based on the PageObject structure within the PDF structure, which offers resistance against attacks such as adding or deleting text to manipulate the page content. By utilizing these document structure-based watermarking techniques, PDF files can be effectively watermarked without compromising the original content and maintaining transparency and integrity in PDF readers.

The field of watermarking in the context of OFD has received limited attention in both academia and industry. In academia, there is a noticeable dearth of research studies and published papers specifically dedicated to OFD watermarking. On the industry front, existing OFD watermarking techniques primarily rely on explicit watermarks, which are implemented based on the following principles:

The watermark text content, along with relevant information such as position, transparency, size, and color, is defined within the annotation structure file named Annotation.xml. This file is an integral part of the internal structure of the OFD file and is typically located in the Annots/Page_n folder. The details of watermark addition are depicted in Figure 4 and Figure 5.

Although the structure of the watermark may seem clear and straightforward, it is susceptible to various attacks. Adversaries have the ability to manipulate the Annotation.xml folder, leading to vulnerabilities in the watermark’s integrity, decryption, and identification, with potential for malicious removal. Consequently, the task of tracing compromised data becomes significantly challenging.

3. Model and Algorithm

3.1. Dynamic Watermarking Implementation

In compliance with regulatory requirements, data users in open banking must possess data traceability capabilities to effectively trace and determine data leakage incidents. Watermarking is a widely used technical approach to trace and assign responsibility in such scenarios.

Based on the aforementioned service architecture, an effective approach for tracking data and mitigating the risk of data leakage involves leveraging the data proximity processing capabilities of the edge cloud, which requires the utilization of the edge cloud’s computing power to implement data leakage tracking technology. Furthermore, it is essential to employ a highly efficient and flexible watermarking algorithm on the edge cloud side to support the tracking of financially sensitive data. Specifically, in the context of OFD file applications, the edge cloud watermarking service facilitates the dynamic addition of timely watermarks after file processing on the edge cloud to ensure effective tracking in the event of data leakage.

Data watermarking in the financial industry encompasses two main approaches: static watermarking and dynamic watermarking. Static watermarking involves adding a large number of watermarks to the data during the pre-preparation phase, which is done once and remains unchanged. On the other hand, dynamic watermarking is performed in real time during the data access process, including data querying, accessing, real-time exchange, and dynamic release. This approach ensures that the watermarks are dynamic and updated in real time.

Unlike static watermarking, which can be pre-processed in batches on the central cloud without real-time requirements, dynamic watermarking is primarily deployed in the edge cloud to meet the demands of real-time data processing. Specifically, in financial business scenarios, where sensitive data need to be accessed by various platforms via API interfaces, the bank’s data documents are appended with a unified static watermark before being shared. Upon reaching the third-party application, a dynamic watermark is added by the “regulatory outpost” based on the application’s information and document content. Subsequently, the watermark information is dynamically replaced at each stage of data usage, ensuring accurate tracking of any potential data file leakage. The following examples provide a step-by-step illustration of the watermarking process:

1.: Adding watermarks during data reception by the application-side database, as depicted in Figure 6. As the application side receives open banking data from a bank, a dynamic watermark is added, either explicitly or implicitly, while the data traverses a supervisory outpost situated in the edge cloud. This watermarking enables traceability in the event of an open banking data breach, allowing for identification of the breaching application side. The standard format typically follows: “Received Data from XXX Bank by XX Organization on xx/xx/xxxx (date). Purpose: XXXX”.;
2.: Adding watermarks during the download of data from the database by application-side employees, as shown in Figure 7. Whenever an application-side employee retrieves data from the application-side database, a dynamic watermark, typically implicit in nature, is embedded. This watermark serves the purpose of identifying the individual responsible for any data leakage when tracing its origin in the context of open banking. The format commonly follows “On xx/xx/xxxx (date), employee xxx downloaded open banking data from the database. Purpose: XXXX”. Remarkably, the newly added watermark can coexist with the original watermark;
3.: Adding watermarks when sharing data with external entities on the application side, as shown in Figure 8. In some cases, the application side needs to desensitize the open banking data, then share it with a partner, such as in the need for business cooperation. Hence, it is necessary to add a watermark to identify the specific partner when the leakage is traced. Typically, the format is “On xx/xx/xxxx (date), xxxx shared open banking data with the collaborator, xxxx. Purpose: XXXX”.

3.2. Dynamic Watermarking Algorithm for OFD

To address the aforementioned scenario, we propose E-SAWM, a watermarking algorithm based on semantic analysis. At the file level, we incorporate a watermark into the key structural file, Content.xml of OFD. This integration renders the entire OFD page corrupted if the attacker deletes the content.xml file. At the content level, we leverage semantic analysis of the structural statements within Content.xml to generate highly realistic pseudo structural statements. These pseudo structural statements, carrying the watermark, are distributed and embedded within each Content.xml file. This distributed embedding approach ensures that the watermark remains concealed, making it challenging for attackers to identify its existence, location, and content. Furthermore, E-SAWM exhibits robustness against attempts to destroy or tamper with the watermark fields.

3.2.1. Semantic Analysis Model

In the realm of natural language processing, computers often face challenges when dealing with complex text systems. Consequently, the conversion of “words” into a form that computers can easily handle has emerged as a pressing concern. To tackle this challenge, word2vec has introduced the concept of mapping “words” to real number vectors, known as word embedding, resulting in word vectors. The Word2Vec model encompasses two primary variants: Skip-Gram and CBOW (Continuous Bag-of-Words). Intuitively, Skip-Gram predicts the context given an input word, whereas CBOW predicts the input word based on the context [19].

Skip-Gram model

In the skip-gram model, every word is associated with two d-dimensional vectors, which are utilized to calculate conditional probabilities. Specifically, for a word indexed as i in the lexicon, the two vectors are represented by

ν_{i} \in R^{d}

and

u_{i} \in R^{d}

when it functions as a central word and a contextual word, respectively. When provided with a central word (

w_{c}

) (indexed as c in the dictionary), the conditional probability of generating any context word (

w_{o}

) (indexed as o in the dictionary) can be modeled through a softmax operation on the dot product of the vectors as follows:

P (w_{o} ∣ w_{c}) = \frac{exp (u_{o}^{T} v_{c})}{\sum_{i \in v} exp (u_{i}^{T} v_{c})}

(1)

where the set of word table indexes is

V = 0, 1, . . ., | V | - 1

. Given a text sequence of length T, where the words at time step t are denoted as

w^{(t)}

, assume that the context words are generated independently given any central word. For a context window (m), the likelihood function of the jump meta model is the probability of generating all context words given any central word:

\prod_{t = 1}^{T} \prod_{- m \leq j \leq m, j \neq 0} log P (w^{(t + j)} | w^{(t)})

(2)

CBOW model

CBOW is a variation of the skip-word model, with the main distinction being that CBOW assumes that the central word is generated based on the surrounding contextual words within the text sequence.

In CBOW, the inclusion of multiple context words is considered. To calculate the conditional probabilities, the context word vectors are averaged. Let

ν_{i} \in R^{d}

and

u_{i} \in R^{d}

represent the vectors corresponding to the context words and central words, respectively, for any word at index i in the dictionary. The conditional probability of generating a central word (

w_{c}

) (indexed by c in the word list) given the context words (

w_{o 1}, . . ., w_{o} 2 m

) (indexed by

o_{1}, . . ., o_{2} m

in the word list) can be represented using the following equation:

P (w_{c} | w_{o 1}, \dots, w_{o 2 m}) = \frac{exp (\frac{1}{2 m} u_{c}^{T} (v_{o 1} + \dots + v_{o 2 m}))}{\sum_{i \in v} exp (\frac{1}{2 m} u_{i}^{T} (v_{o 1} + \dots + v_{o 2 m}))}

(3)

Let

W_{o} = w_{o} 1, . . ., w_{o} 2 m

,

\bar{v_{o}} = \frac{v_{o 1} + \dots + v_{o 2 m}}{2 m}

; then, the above equation can be simplified as

P (w_{c} | W_{o}) = \frac{exp (u_{c}^{T} {\bar{v}}_{o})}{\sum_{i \in V} exp (u_{i}^{T} {\bar{v}}_{o})}

(4)

Considering a text sequence of length T, where the words at time step t are represented as

w^{(t)}

and employing a context window of size m, the likelihood function of the CBOW expresses the probability of generating all central words given their respective context words:

\prod_{t = 1}^{T} P (w^{(t)} | w^{(t - m)}, \dots, w^{(t - 1)}, w^{(t + 1)}, \dots, w^{(t + m)})

(5)

Word-embedding model comparison

Assuming a text corpus with V words and a window size of K, the CBOW model predicts approximately

O (V)

, which is equivalent to the number of words in the corpus. In contrast, Skip-gram performs more predictions than CBOW. In Skip-gram, each word is predicted once using surrounding words when it serves as the central word, resulting in a time complexity of

O (K V)

.

While CBOW trains faster than Skip-gram, the latter produces superior word vector representations. When dealing with a corpus containing many low-frequency words, Skip-gram provides better word vectors for these words but requires more training time. Conversely, CBOW is more efficient in such cases. The choice between the models depends on specific requirements. For higher prediction accuracy and lower training efficiency, the Skip-gram model is preferred. Conversely, the CBOW model can be chosen [20].

3.2.2. OFD Watermarking Algorithm Based on Semantic Analysis

Word2vec is a widely utilized concept across various domains. Tomas Mikolov introduced doc2vec, an algorithm that enables the representation of sentences or short texts as vectors by considering sentences of different lengths as training samples [21]. In the field of biology, Asgari and Mofrad proposed BioVec for the analysis of biological sequence word vectors [22].

For OFD, the structural documents consist of statements that adhere to specific rules. Here, we use the term “structural statements” to refer to the structural information present in OFD files. By treating these statements as natural languages, it becomes possible to generate a context specific to each structural document. Consequently, we devised a sophisticated watermark-embedding algorithm by leveraging semantic analysis. In the case of contextual datasets, the process involves mapping word separation to word vectors distributed in a high-dimensional space using word2vec. This mapping enables the evaluation of word similarity. When dealing with known contexts, we utilize the word2vec model to transform them, resulting in k words that closely resemble the original context words. Subsequently, these words are distributedly embedded within the original structural document, serving as watermark carriers. Our approach capitalizes on semantic analysis to develop a highly covert watermark-embedding algorithm.

The algorithm follows the flow depicted in Figure 9 and is divided into four main modules:

1.: Semantic analysis model trainingConstruct a pseudo structural library based on the original structural library of OFD. Gather n instances of context data from structured documents in OFD format. Utilizing these context data, along with the pseudo structure body library, generate $n^{'}$ instances of the context dataset with pseudo structure bodies. These contextual datasets are then trained separately using the CBOW model and the Skip-gram model to develop the semantic analysis model. When conducting semantic analysis on an OFD-structured document, the context is initially extracted. For a context dataset containing a higher frequency of low-frequency structural bodies, the Skip-gram model is preferred for semantic analysis due to its improved performance and efficiency. On the other hand, for the contextual dataset containing a higher frequency of high-frequency structures, CBOW is used for semantic analysis.
Assume that m structural files with embeddable watermarks are extracted for an OFD file that requires watermark addition. $V_{i}$ structural keywords are extracted from file $F_{i}$ , and the training window size is K. In such cases, the time complexity for training using the CBOW model can be calculated as follows:

$T_{C B O W} = V_{0} + V_{1} + \dots + V_{m - 1} = O (\sum_{i = 0}^{m - 1} V_{i})$

(6)

The time complexity for training using the Skip-gram model is as follows:

$T_{skip - gram} = K \cdot V_{0} + K \cdot V_{1} + \dots + K \cdot V_{m - 1} = O (K \sum_{i = 0}^{m - 1} V_{i})$

(7)

Based on the size of the text words, we set the threshold ( $τ$ ) to select the model with the best training effect for calculation:

$model = \{\begin{matrix} skip - gram, & \sum_{i = 0}^{m - 1} V_{i} < τ \\ CBOW, & \sum_{i = 0}^{m - 1} V_{i} < τ \end{matrix}$

(8)

As mentioned in Section 3.2.1, the Skip-gram model exhibited higher accuracy compared to the CBOW model in our experiments. In particular, when the time consumed is similar, the Skip-gram model outperforms CBOW. In our work, this occurred when $τ$ had a value of 1000.
2.: Watermark content processing. Encrypt the watermark text (originalInfo) using the SM4 algorithm with a key derived from the combination of the file name (fileName) of the watermarked file, and a custom string (myString) provided by the adder. This encryption process is represented by Equation (9).

$S e c r e t W a t a r m a r k M e s s a g e = S M 4 ((f i l e N a m e o m y S t r i n g) | o r i g i n a l I n f o)$

(9)

Subsequently, convert the encrypted watermark message into a byte array and perform grouping on the byte array;
3.: Watermark embedding. For each structural file within the target OFD file, conduct structure extraction. Combine these structures as contexts in their original order. Utilize the semantic analysis model trained in step 1 to perform semantic analysis, then obtain pseudo structures with the top K similarity. Insert these pseudo structures into the structural files of the OFD, and embed the watermark grouping acquired in step 2 into each pseudo structure;
4.: Watermark Extraction. Extract structure names and contents from all structural files within the OFD file intended for watermarking, which is compared with the corpus, and filter out any pseudo structures. Based on the grouping information within the structure, combine byte array groups associated with the same watermark. This process results in a complete byte array, which is then parsed into a string and further parsed using a key. Finally, the complete watermark is obtained.

4. Experiments

To evaluate the effectiveness of our proposed OFD watermarking algorithm that utilizes semantic analysis, we performed various attack tests, including a robustness test, steganography, and watermark capacity detection [17,23]. Additionally, we compare our findings with the results obtained by existing OFD watermarking algorithms commonly employed in the industry.

4.1. Steganography

One of the fundamental requirements of an invisible watermark is its imperceptibility. The embedded watermark in the OFD document must remain completely hidden, ensuring that no noticeable alterations are made to the visible display interface of the document. Moreover, users should be unable to detect the presence of the watermark, making it challenging for attackers to identify its location or develop cracking methods.

The most common watermarking algorithm employed in the industry is categorized as an explicit watermarking algorithm, relying on annotated files. When a highly transparent watermark is added, it becomes difficult to visually discern the watermark with the naked eye. Nevertheless, it is still possible to identify the watermark by converting the OFD page into an image and adjusting the image’s contrast. In contrast, a low-transparency watermark is clearly visible to the naked eye. To the best of our knowledge, there is no existing research on watermarking of OFD files. While some studies have focused on watermarking PDF files using syntax- and format-based algorithms, these approaches tend to alter the original text content, which may not comply with the originality requirements for digital products.

We propose an experiment to test the steganographic potential of E-SAWM. Figure 10 shows the comparison effect of OFD document watermarking. The two documents appear indistinguishable to the naked eye, and even after converting OFD pages to images, the watermark information remains hidden. In comparison to traditional watermarking algorithms that rely on annotated documents, the proposed OFD watermarking algorithm based on semantic analysis exhibits robust steganography capabilities.

4.2. Robustness

We provide users with various editing options to test the algorithm’s robustness. Robustness testing is conducted following the approach outlined in reference [18]. These options include highlighting, underlining, strikethrough, wavy lines, handwritten scribbles, and text overlay. Users can apply these edits to randomly selected locations within the watermarked OFD file.

The visual appearance of the watermark information in the industry’s OFD watermarking algorithm, which is based on an annotation structure, can be influenced by attacks like highlighting and underlining, but it does not compromise integrity.

We conducted robustness testing on E-SAWM. Figure 11 illustrates an example of an attack, while Table 2 presents the results of watermark extraction. E-SAWM introduces the integration of semantically similar pseudo structures into structured files. Notably, any structural changes that may arise in the original file when incorporating features like highlighting or underlining have no impact on the pseudo structural content. Consequently, the results show that our OFD watermarking algorithm demonstrates effective resistance against the attacks listed in the table, achieving a 100% success rate in watermark extraction for OFD files under each attack type.

4.3. Watermark Capacity

Watermark capacity refers to the proportion of the watermark information size to the size of the document being watermarked. It can be calculated as follows:

WatermarkCapacity = \frac{watermark data bits}{OFD file bits}

(10)

By utilizing a watermark algorithm with a higher watermark capacity, the document can be embedded more effectively within the watermark information. This capability enables the handling of diverse document lengths and multilevel watermarks, facilitating the transmission of larger amounts of information in practical applications.

In the annotation structure-based watermarking algorithm, the process of adding a watermark involves appending both annotation structure information and the watermark content itself to the corresponding Annot.xml file of the target watermark page. Conversely, E-SAWM converts and encrypts the watermark information into multiple groups (referred to as k groups). These k groups, along with their corresponding encrypted watermark characters, are then added to the Content.xml file of the designated watermark page. Compared to the annotation structure-based algorithm, our algorithm enriches the original OFD file with additional information during the watermarking process.

To assess whether the increased information can be accommodated within the acceptable carrying range, we conducted a watermarking capacity test. For this evaluation, we randomly selected twenty OFD files of various sizes as samples. Additionally, we generated twenty watermarks with different information contents. Each watermark was individually matched with a corresponding file and embedded using the watermark algorithm. To measure the impact of watermarking on document size, we calculated the rate of change in the OFD file size by comparing its size before and after the watermark-embedding process, as shown in Figure 12.

The experiments demonstrate that when embedding watermarks of various sizes into each sample OFD document, the document size experiences minimal fluctuations, which suggests that E-SAWM effectively handles the embedding of high-capacity watermark information without significantly impacting the document size. Moreover, it showcases the algorithm’s ability to accommodate a substantial amount of watermark information.

5. Conclusions

With the rapid development of the Internet, ensuring data security has become a critical concern within the financial industry. Tracing leaked data plays a crucial role in safeguarding data integrity. The growing use of OFD documents, particularly in electronic tax returns and statements, emphasizes their importance in the financial sector.

We propose an innovative OFD watermarking framework, E-SAWM, in the edge cloud scenario that utilizes semantic analysis to incorporate implicit watermarks into OFD documents. By encrypting watermarking information into highly simulated structural statements and securely embedding them within the structural components of OFD files, E-SAWM provides a robust solution. Experimental evaluations confirm the effectiveness of the algorithm, demonstrating its high concealment, strong robustness, and substantial watermarking capacity. Consequently, the proposed algorithm enhances data security in the financial industry. Overall, our research contributes to the advancement of data security measures in the financial domain, addressing the pressing need for traceability and protection against data leakage in an era of rapid technological advancements.

Author Contributions

Conceptualization, L.Z. (Lijun Zu), H.L. and L.Z. (Liang Zhang); Methodology, L.Z. (Lijun Zu) and J.Y.; Software, L.Z. (Liang Zhang); Validation, L.Z. (Lijun Zu) and H.L.; Formal Analysis, L.Z. (Lijun Zu) and H.L.; Investigation, L.Z. (Lijun Zu) and X.Z.; Resources, L.Z. (Liang Zhang) and X.Z.; Data Curation, X.Z.; Writing—Original Draft Preparation, L.Z. (Lijun Zu) and L.Z.; Writing—Review and Editing, L.Z. (Lijun Zu) and H.L.; Visualization, H.L.; Supervision, Z.L.(Lijun Zu) and S.H.; Funding Acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (grant number 2021YFC3300600) and National Natural Science Foundation of China under Grant (No. 61873309, No. 92046024, No. 92146002) and Shanghai Science and Technology Project under Grant (No. 22510761000).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Premchand, A.; Choudhry, A. Open banking & APIs for Transformation in Banking. In Proceedings of the 2018 International Conference on Communication, Computing and Internet of Things (IC3IoT), Chennai, India, 15–17 February 2018; pp. 25–29. [Google Scholar]
Longlei, H.; Peiliang, Z.; Hua, J. Research and implementation of format document OFD electronic seal module. Inf. Technol. 2016, 40, 76–80. [Google Scholar]
Kassab, M.; Laplante, P. Trust considerations in open banking. IT Prof. 2022, 24, 70–73. [Google Scholar]
Hong, X.; Wang, Y. Edge computing technology: Development and countermeasures. Strateg. Study Chin. Acad. Eng. 2018, 20, 20–26. [Google Scholar]
Giust, F.; Costa-Perez, X.; Reznik, A. Multi-access edge computing: An overview of ETSI MEC ISG. IEEE Tech Focus 2017, 1, 4. [Google Scholar]
Cui, G.; He, Q.; Li, B.; Xia, X.; Chen, F.; Jin, H.; Xiang, Y.; Yang, Y. Efficient verification of edge data integrity in edge computing environment. IEEE Trans. Serv. Comput. 2021, 15, 3233–3244. [Google Scholar]
Gu, L.; Zhang, W.; Wang, Z.; Zeng, D.; Jin, H. Service Management and Energy Scheduling Toward Low-Carbon Edge Computing. IEEE Trans. Sustain. Comput. 2022, 8, 109–119. [Google Scholar]
Zhang, Z.; Avazov, N.; Liu, J.; Khoussainov, B.; Li, X.; Gai, K.; Zhu, L. WiPOS: A POS terminal password inference system based on wireless signals. IEEE Internet Things J. 2020, 7, 7506–7516. [Google Scholar]
Ati, M.; Al Bostami, R. Protection of Data in Edge and Cloud Computing. In Proceedings of the 2022 IEEE International Conference on Computing (ICOCO), Sabah, Malaysia, 14–16 November 2022; pp. 169–173. [Google Scholar]
Chen, L.; Liu, Z.; Wang, Z. Research on heterogeneous terminal security access technology in edge computing scenario. In Proceedings of the 2019 11th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), Qiqihar, China, 28–29 April 2019; pp. 472–476. [Google Scholar]
Song, Z.; Ma, H.; Zhang, R.; Xu, W.; Li, J. Everything Under Control: Secure Data Sharing Mechanism for Cloud-Edge Computing. IEEE Trans. Inf. Forensics Secur. 2023, 18, 2234–2249. [Google Scholar]
Wang, Z.; Fan, J. Flexible threshold ring signature in chronological order for privacy protection in edge computing. IEEE Trans. Cloud Comput. 2020, 10, 1253–1261. [Google Scholar]
GB/T 33190; 2016 Electronic Files Storage and Exchange Formats—Fixed Layout Documents. China National Standardization Management Committee: Beijing, China, 2016.
Yu, Z.; Ting, L.; Yihen, C.; Shiqi, Z.; Sheng, L. Natural Languagetext Watermarking. J. Chin. Inf. Process. 2005, 19, 57–63. [Google Scholar]
Wang, X.; Jin, Y. A high-capacity text watermarking method based on geometric micro-distortion. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 1749–1755. [Google Scholar]
Zhao, W.; Guan, H.; Huang, Y.; Zhang, S. Research on Double Watermarking Algorithm Based on PDF Document Structure. In Proceedings of the 2020 International Conference on Culture-oriented Science & Technology (ICCST), Beijing, China, 28–31 October 2020; pp. 298–303. [Google Scholar]
Zhengyan, Z.; Yanhui, G.; Guoai, X. Digital watermarking algorithm based on structure of PDF document. Comput. Appl. 2012, 32, 2776–2778. [Google Scholar]
Khadam, U.; Iqbal, M.M.; Habib, M.A.; Han, K. A Watermarking Technique Based on File Page Objects for PDF. In Proceedings of the 2019 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), Victoria, BC, Canada, 21–23 August 2019; pp. 1–5. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 1–9. [Google Scholar]
Asgari, E.; Mofrad, M.R. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 2015, 10, e0141287. [Google Scholar]
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
Hao, Y.; Chuang, L.; Feng, Q.; Rong, D. A Survey of Digital Watermarking. Comput. Res. Dev. 2005, 42, 1093–1099. [Google Scholar]

Figure 1. Deployment of Regulatory Outposts on Edge Clouds.

Figure 2. Data processing work flow in regulatory outposts within edge cloud scenarios. Components of the regulatory outpost data process: (1) Data provider: a bank or transit platform responsible for data processing and forwarding. (2) Data storage and destruction: a database provided by the application, subject to audit by regulatory outposts. (3) Data user: terminal equipment or other business systems accessing the database for tasks such as data display, statistical analysis, and external sharing.

Figure 3. Structure of OFD.

Figure 4. OFD annotation file contents for watermarking.

Figure 5. Illustration of OFD page with added explicit watermark.

Figure 6. Adding watermark to data received from banking applications database.

Figure 7. Adding watermark when application-side employees download data from the database.

Figure 8. Adding watermark when the application side shares externally.

Figure 9. Overview of E-SAWM.

Figure 10. Visual comparison of watermark effects. (a) OFD page before adding watermark. (b) OFD page after adding watermark.

Figure 11. Example of each post-attack OFD page. Please note that in the visual representation presented: Colorful lines symbolize distinct attack methods, with yellow representing “highlight”, blue indicating “Wavy line”, a green line representing “underline”, and red signifying “Strikethrough”. Green font signifies “Handwritten graffiti”, while gray font indicates “Text overlay”.

Figure 12. Watermark capacity tests. (a) Watermark data size in various OFD Files. (b) Variation of the corresponding OFD files before and after watermark embedding.

Table 1. Internal Structural file description of OFD.

FLIE/FOLDER	Description
OFD.xml	OFD file main entry file; describes the basic OFD file information
Doc_N	The Nth document folder
Documcnt.xml	Doc_N folder description file, including information about subfiles and subfolders contained under Doc_N
Page_N	The Nth page folder
Content.xml	Content description on page N
PageRes.xml	Resource description on page N
Res	Resource folder
PublicRes.xml	Document public resources index
DocumentRes.xml	Document own resource index
Image_M.png/Font_M.ttf	Resource files

Table 2. Extraction success rate of watermarks following each attack.

Attack Type	Example of Attack Content	Watermark Extraction Success Rate
Highlight	Left Column—Line 1 Right Column—Lines 2–5	100%
Underline	Left Column—Line 2 Right Column—Lines 2–3	100%
Strikethrough	Left Column—Line 5 Right Column—Lines 2–3	100%
Wavy line	Left Column—Line 6 Right Column—Line 4	100%
Handwritten graffiti	Right Column—Lines 2	100%
Text overlay	Full Page	100%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zu, L.; Li, H.; Zhang, L.; Lu, Z.; Ye, J.; Zhao, X.; Hu, S. E-SAWM: A Semantic Analysis-Based ODF Watermarking Algorithm for Edge Cloud Scenarios. Future Internet 2023, 15, 283. https://doi.org/10.3390/fi15090283

AMA Style

Zu L, Li H, Zhang L, Lu Z, Ye J, Zhao X, Hu S. E-SAWM: A Semantic Analysis-Based ODF Watermarking Algorithm for Edge Cloud Scenarios. Future Internet. 2023; 15(9):283. https://doi.org/10.3390/fi15090283

Chicago/Turabian Style

Zu, Lijun, Hongyi Li, Liang Zhang, Zhihui Lu, Jiawei Ye, Xiaoxia Zhao, and Shijing Hu. 2023. "E-SAWM: A Semantic Analysis-Based ODF Watermarking Algorithm for Edge Cloud Scenarios" Future Internet 15, no. 9: 283. https://doi.org/10.3390/fi15090283

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

E-SAWM: A Semantic Analysis-Based ODF Watermarking Algorithm for Edge Cloud Scenarios

Abstract

1. Introduction

2. Related Work

2.1. Application of Edge Computing in the Domain of Financial Data Protection

2.2. Edge Cloud-Based Financial Regulatory Outpost Technology

2.2.1. Regulatory Outpost—Data Input Processing

2.2.2. Regulatory Outpost—Data Export Processing

2.3. Document Watermarking Techniques

3. Model and Algorithm

3.1. Dynamic Watermarking Implementation

3.2. Dynamic Watermarking Algorithm for OFD

3.2.1. Semantic Analysis Model

3.2.2. OFD Watermarking Algorithm Based on Semantic Analysis

4. Experiments

4.1. Steganography

4.2. Robustness

4.3. Watermark Capacity

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI