1. Introduction
There is a lot of information from paper documents for human communication. Paper documents often contain typical elements such as text, tables, stamps, and signatures. Due to the advancement of digital technologies, a massive of document images have been digitized in Taiwan, so users can now easily search and access these document images. As for document image searching, some document image retrieval methods [
1,
2,
3] have been developed. In ref. [
1], some representative features extracted from document images by using a Convolutional Neural Network (CNN) were used for a retrieval task. As for personal information protection, since the property rights may be transferred and spatial attributes may be modified, the document image should contain these changes. Accordingly, it is significant that document images can generate a great influence on real estate value in a way that consequently causes great concern for people’s property rights. However, some malicious users may not have any difficulties obtaining the property owner’s information or even manipulating document images for illegal purposes. This is why document images should be kept secure and private in terms of privacy policies.
Figure 1 illustrates three examples of digitized document images in Taiwan. As we can observe in
Figure 1, each document image usually has a page frame, which is the smallest rectangle and encloses most of the foreground elements; some Chinese words are around the page frame, and some noise may exist. The foreground elements in the page frame often include the owner’s information, designer’s information, and building’s information. To avoid deliberately prying or illegal access, it is important to keep the property owner’s information private and secure according to the viewpoint of the privacy policy. However, the cost of manually detecting and concealing the property owner’s information in document images is very high. This means that automatically detecting the personal information in document images is an important pre-processing step for intelligent and efficient application software. Furthermore, as shown in
Figure 1, personal information usually follows specific printed Mandarin Chinese words, such as “業主” (Property Owner), “姓名” (Name), “住址” (Address), etc. It is expected that correctly detecting suitable, specific, printed Mandarin Chinese words can easily pin down personal information in a document image. In this sense, those observations have motivated us to develop a scheme for automatically detecting specific printed Mandarin Chinese words that are useful to localize personal information in document images.
According to traditional document processing methods, image layout analysis, which is performed to locate some regions of interest (ROIs), is often designed as a pre-processing step for further analysis and applications such as personal information protection, document classification, document image analysis, and document image retrieval [
4,
5]. In our previous work [
4], we proposed a coarse-to-fine scheme to automatically detect the ROIs for digitized cadastral images in Taiwan. The existing scheme was composed of four parts: pre-processing, skew correction, noise reduction, and ROI localization. After finding the coordinates of the candidate region in the de-skew image with high resolution, ROIs can be localized in the high-resolution images by using the ROI location algorithm in the fine detection. Although ROIs can be localized in the digitized cadastral images, it is still a little complicated to find the correct position of personal information according to the layout information of cadastral images. In ref. [
5], a machine-learning-based patient identification recognition method was proposed for a medical information system. In the proposed method [
5], the color information is used to identify camera-captured screen images, a bilateral filter is used to reduce the effect of noise in captured screen images, and then the color and spatial information are used to initially and roughly locate the candidate region. After skew correction, a template-matching algorithm is used to find special symbols for locating the ROI. Then, the patient identification information can be recognized in the ROI. Unfortunately, traditional image layout analysis methods may not deal with noisy document images well. It is expected that poor image layout analysis or document segmentation may affect the following recognition process [
6].
Compared with traditional machine-learning methods developed based on hand-crafted features, Deep Neural Networks (DNN) [
7,
8] have received more and more attention due to their excellent performance in image classification, speech recognition, fraud detection, and so on. Many CNNs, such as AlexNet, VGG, GoogleNet, and ResNet, demonstrate that they have better capabilities to extract multi-level visual features for improving the accuracy of classification models [
7,
8,
9,
10,
11,
12,
13]. As for document image analysis and scene image analysis, there are some deep-learning-based methods [
6,
14,
15,
16,
17,
18]. For example, a deep-learning approach is proposed based on a fully convolutional network (FCN) [
9] for document segmentation [
15]. In ref. [
16], a fast CNN-based method is proposed to automatically perform layout analysis for document images. In the existing method [
16], a document image is segmented into some blocks, and these blocks are classified into three categories, i.e., text, table, and image, based on a CNN. In [
6], some deep CNNs such AlexNet, VGG-16, GoogleNet, and ResNet-50 are used to distinguish document images into some categories such as emails, news articles, and invoices for document image classification. In ref. [
17], an existing method, EAST, similar to FCN [
9], is exploited to detect text regions in scene images. In EAST, multi-level feature maps are extracted, merged, and fed into the output layer for pixel-level predictions of text regions. Unfortunately, the text detection method [
17] may not be suitable to analyze these document images, which usually have many Mandarin Chinese words in
Figure 1.
As far as we are concerned, object detection plays an important role in vision-based applications such as face detection, traffic sign detection, character/digit detection, person re-identification, animal detection, and meter reading [
5,
19,
20,
21,
22]. So far, in addition to traditional machine-learning-based methods, there are some CNN-based object-detection algorithms, including R-CNN (region-based convolutional neural network), Faster R-CNN, you only look once (YOLO), and a single-shot multiBox detector (SSD), RetinaNet et al. [
23]. For example, an automatic meter-reading method was proposed for gas meter reading [
20]. The proposed method [
20] is composed of three steps: meter detection, digit segmentation, and number recognition. In the three steps, YOLOv3, maximally stable extremal regions (MSER), and a modified VGG network are adopted in [
20]. In [
22], a two-stage method is proposed for an automatic meter reading. In the first stage, a small model, called fast-YOLO, is used for counter-detection. After counter-detection, digit recognition can be achieved based on CR-NET, CRNN, and multi-task learning in the second stage. For classifying document images, some Mandarin Chinese words such as “業主” (Property Owner) and “起造人姓名” (Name of Applicant) can be used as the special information to distinguish
Figure 1a from
Figure 1b. This circumstance has inspired us to analyze the document images for finding the specific information via CNNs. Therefore, we aim at developing a CNN-based object-detection scheme to directly detect and recognize some Mandarin Chinese words in noisy document images for document image analysis.
The remainder of this paper is organized as follows.
Section 2 describes the system description of the proposed keyword detection scheme.
Section 3 and
Section 4 elaborate the key character detection and lexicon analysis of the proposed scheme, respectively.
Section 5 shows experimental results, and
Section 6 gives a discussion.
Section 7 draws a conclusion and future work.
2. System Description
As mentioned in the previous section, some specific Mandarin Chinese words in document images can be considered as objects. Consider an input document image of size , where denotes the pixel value at the coordinate , and represent the width and height of X, respectively. The goal of the proposed scheme is to detect keywords, , in X. As we know, a keyword (i.e., specific word) is often composed of some key characters (i.e., specific characters) in Mandarin Chinese, i.e., , where denotes the key character. For example, the keyword “姓名” (Name) is composed of two key characters, “姓” (Surname) and “名” (First Name), in Mandarin Chinese. This means that a keyword can be found after analyzing these detected key characters. Therefore, it is important to efficiently localize and analyze the key characters from for the proposed keyword-detection scheme.
As shown in
Figure 1, detecting key characters in digitized document images should consider some issues. First, document quality and digitization processing may affect the image properties of digitized document images. As shown in
Figure 1a, the visual quality of
Figure 1a is poor, and it is tough to analyze degraded document images to detect key characters. This means that the image quality and spatial resolutions of scanned document images are not the same. Second, marginal noise, artificial noise, and random noise may exist in digitized document images. As shown in
Figure 1, there is marginal noise around the paper frames, and a seal occurs to be artificial noise on the bottom-right part of
Figure 1a. It is no doubt that analyzing the layout of a noisy document image is difficult. Third, these documents may have some popular keywords, but some seldom happen. For example, the keyword “姓名” (Name) usually exists in document images, but “申請人” (Applicant) may only happen in some kinds of documents. Fourth, keywords may appear in some possible locations and with different fonts. For instance, the keywords “姓名” (Name) have different fonts in
Figure 1.
To consider the above issues and the efficiency requirement for keyword detection, the block diagram of the proposed scheme is shown in
Figure 2. As shown in
Figure 2, the proposed scheme is composed of two parts: key character detection and lexicon analysis.
The first part is designed to effectively detect the key characters in a document image. As mentioned in
Section 1, CNN-based object-detection methods can be used to localize key characters. These CNN-based object-detection methods can be classified into two classes: one stage and two stages. For two-stage methods, the first step is to find many region proposals, and then these region proposals are analyzed, located, and recognized in the other step. Some frameworks, such as R-CNN, Fast R-CNN, and Faster R-CNN, are two-stage methods [
24,
25,
26]. On the contrary, one-stage methods [
11,
12,
13,
27], such as SSD, RetinaNet, and YOLO, perform region proposal searching, region proposal localization, and region proposal recognition at the same stage. Though there are more proposals in the one-stage methods compared with the two-stage methods, the one-stage object-detection methods still have a shorter computational time.
In addition, key characters with different frequencies of occurrence may occur in a document image, and the number of the key characters may vary in document images. As we can see in
Figure 1, some key characters, e.g., “名” (Name), often exist, but some, e.g., “業” (Property), seldom exist in document images. This means that class imbalance may happen so that some key characters with low appearance frequencies may not be detected well. As we know, the one-stage, CNN-based object-detection method, RetinaNet [
27] with a novel loss function, can not only detect objects but also reduce the impact of class imbalance on object classification. From the view of object-detection performance, RetinaNet outperforms SSD, YOLOv2, and YOLOv3 for the COCO dataset [
28]. From the view of computation complexity, an object-detection network usually has higher computational complexity compared with an image classification network [
28]. For one-stage object-detection networks, RetinaNet is superior to YOLOv3 in terms of FLOPs (floating-point operation) and mAP (mean average precision) for object detection [
28]. This means that RetinaNet is a better choice of object detection in terms of computational complexity and object-detection performance. Thus, we develop the key character-detection method based on RetinaNet [
27].
After key character detection, the other part is to analyze these detected key characters to obtain the keywords in a document image. Then, the detected keywords can be exploited for personal information protection and document image classification. In the following, we elaborate on each part of the proposed scheme.
4. Lexicon Analysis
Since a document image should have some key characters, these detected characters may be located within the ROI after key character detection. In fact, some detected characters may not be helpful for finding personal information. Then, the following question is related to how to analyze the detected key characters to obtain the correct keywords in ROIs.
In fact, a word is often composed of two or three characters in Mandarin Chinese. When trying to find meaningful keywords in a document image, there are two problems. The first one is that two or three arbitrary characters may not be a meaningful word in Mandarin Chinese. For example, the keyword “業主” (Property Owner) composed of “業” (Property) and “主” (Owner) is a meaningful word, but the combination of “址” (Address) and “主” (Owner) is non-meaningful in Mandarin Chinses. In addition, the order of two or three characters is also important for the combination of a meaningful word. For example, the combination of “姓名” (Surname–First Name) is meaningful, but the combination of “名姓” (First Name–Surname) is not. This means that a lexicon analysis is necessary in the proposed scheme.
On the other hand, the spatial relation of two or three key characters could be horizontal and vertical. As shown in
Figure 1a, the keywords may be arranged in the horizontal or vertical direction. For example, the two keywords, “業主” (Property Owner) and “住址” (Address), are in the vertical and horizontal directions, respectively. This means that the lexicon analysis algorithm should be performed in the horizontal and vertical directions. Furthermore, the observation that the spatial distance of two or three key characters may be different can be observed in
Figure 1. For example, the spatial distance between “姓” (Surname) and “名” (First Name) in
Figure 1a is smaller than that in
Figure 1b. To find two or three characters to generate a meaningful word, a searching algorithm is proposed here. Therefore, a lexicon analysis algorithm is necessary to find and analyze the combination of some possible key characters for obtaining correct keywords.
To devise a searching algorithm in the lexicon analysis, an observation that the distance between two characters is usually equal to multiple times a character’s width with a document in Taiwan can be utilized. This means that a character’s width can be considered as a prior information for developing the searching algorithm. To estimate a character’s width, the bounding boxes enclosing the detected characters are collected during the training phase, and then their width information is analyzed to estimate the size information as a prior information.
Figure 6 illustrates the lexicon analysis. Assume there are
keywords determined in the proposed scheme. Then, the lexicon analysis algorithm in the horizontal direction is described as follows:
- (L1)
Select one possible key character as the first one and then measure its center , whose coordinate is . For example, “姓” (Surname) is the first key character.
- (L2)
Determine whether another key character is around . According to the prior width information of a character, the positions whose horizontal coordinate is around the position are checked whether a key character exists, where j is an integer and is the estimated width of a key character. If there is a key character around the positions, the second character is considered as the second one . For example, “名” (First Name) is the second character. Then, the two key characters are considered as the first keyword candidate . For example, “姓名” (Name) is the candidate of the first keyword.
- (L3)
Decide whether the combination of the two key characters matches one of the pre-defined keywords. If the first keyword candidate matches one of the pre-defined keywords, is a detected keyword. For example, “姓名” (Name) is meaningful, and it is also one of the pre-defined keywords.
- (L4)
Repeat Steps (L2) to (L3) to obtain a keyword if the combination of these key characters matches one of the pre-defined keywords. For example, “住址” (Address) is another pre-defined keyword.
- (L5)
Repeat Steps (L1) to (L4) to detect all of the keywords.
Similar to Steps (L1) to (L5), the lexicon analysis algorithm can be performed in the vertical direction to detect some keywords, e.g., “業主” (Property Owner). After lexicon analysis, these keywords can be detected in document images.
6. Discussions
To re-train the model of RetinaNet with a pre-trained model, synthetic image generation and data augmentation are exploited to yield a large image dataset. Then, a testing set of 112 images is used for performance evaluation.
As for key character detection,
Figure 13 shows the AP rates of the proposed key-character-detection algorithm. As shown in
Figure 13, most AP rates are higher than 0.82, except for two key characters, “申” (Apply) and “築” (Building). According to
Figure 10a4,
Figure 11, and
Figure 12, the main reason is that the visual quality of the two key characters, “申” (Apply) and “築” (Building), is poor, which consequently causes the lower AP rates of the proposed key-character-detection algorithm. Furthermore, the mAP of the proposed scheme is 85.1% for all of the key characters. The experimental results show that the proposed scheme based on RetinaNet can locate and recognize the key characters successfully even though when the quality of input images is not good.
As for keyword detection,
Figure 14 illustrates the AP results of the proposed keyword-detection algorithm. As shown in
Figure 14, most AP rates are higher than 0.89, except for two keywords, “申請人” (Applicant) and “建築” (Building). According to
Figure 10,
Figure 11,
Figure 12 and
Figure 13, the main reason is that the proposed scheme may not detect the two key characters, “申” (Apply) and “築” (Building), of poor quality, as well as the two keywords, “申請人” (Applicant) and “建築” (Building). In addition, the mAP of the proposed scheme is 85.84%, as shown in
Figure 14. This means that the experimental result shows that the lexicon analysis can effectively find some suitable key characters, and the keywords can be obtained by analyzing the combination of these key characters. Therefore, the experimental results demonstrate that the proposed keyword-detection algorithm can effectively locate and recognize the keywords in document images with different image qualities.
The execution time of the proposed scheme is about 0.2 frames per second for the experiment platform. It is expected that the execution time of the proposed scheme can be reduced if our program is optimized or a powerful graphics card is used.
Since the proposed scheme is developed for some specific document images, such as building use permits in Taiwan, the key characters and the keywords are determined according to our experiences and analysis. This means that the proposed scheme is only suitable to deal with some specific document images, such as building use permits in Taiwan.
7. Conclusions and Future Work
In this paper, a keyword detection scheme was proposed based on deep convolutional neural networks for personal information protection in document images. The proposed scheme is composed of key character detection and lexicon analysis. The first part is key character detection developed based on RetinaNet and transfer learning. To find the key characters, RetinaNet, composed of convolutional layers, feature pyramid network, and two subnets, is exploited to detect key characters within the region of interest in a document image. After the key character detection, lexicon analysis is used where the detected key characters are analyzed, and it combines them to obtain the keywords.
To train the model of RetinaNet, synthetic image generation and data augmentation are exploited to yield a large image dataset. To evaluate the proposed scheme, IoU (Intersection Over Union) and mAP (Mean Average Precision) are utilized as performance measurements. For performance evaluation, many document images with different types are selected for testing. The experimental results show that the proposed scheme can detect these key characters well even under situations of poor visual quality, different fonts, and different sizes. The mAP rates of the proposed scheme are 85.1% and 85.84% for key character detection and keyword detection, respectively. Furthermore, the proposed scheme is superior to Tesseract OCR (Optical Character Recognition) software for detecting the key characters in document images with Mandarin Chinese words. The experimental results demonstrate that the proposed method can effectively localize and recognize these keywords within document images with Mandarin Chinese words.
Since the proposed scheme is developed for specific document images, such as building use permits in Taiwan, the future work should increase the number of keywords to deal with more kinds of document images in Taiwan. Furthermore, detecting and analyzing the keywords in documental images can be utilized for document image analysis and document classification.