UTTSR: A Novel Non-Structured Text Table Recognition Model Powered by Deep Learning Technology

Li, Min; Zhang, Liping; Zhou, Mingle; Han, Delong

doi:10.3390/app13137556

Open AccessArticle

UTTSR: A Novel Non-Structured Text Table Recognition Model Powered by Deep Learning Technology

¹

Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 250014, China

²

Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan 250014, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(13), 7556; https://doi.org/10.3390/app13137556

Submission received: 29 May 2023 / Revised: 17 June 2023 / Accepted: 21 June 2023 / Published: 27 June 2023

(This article belongs to the Special Issue Intelligent Analysis and Image Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

To prevent the compilation of documents, many table documents are formatted with non-editable and non-structured texts such as PDFs or images. Quickly recognizing the contents of tables is still a challenge due to factors such as irregular formats, uneven text quality, and complex and diverse table content. This article proposes the UTTSR table recognition model, which consists of four parts: text region detection, text line detection and recognition, and table sequence recognition. For table detection, the Cascade Faster RCNN with the ResNeXt105 network is implemented, using TPS (Thin Plate Spline) transformation and affine transformation to correct the image and to improve accuracy. For text line detection, DBNET is used with Do-Conv in FPN (Feature Pyramid Networks) to speed up training. Text lines are recognized using CRNN without the CTC module, enhancing recognition performance. Table sequence recognition is based on the transformer combined with post-processing algorithms that fuse table structure sequences and unit grid content. Experimental results show that the UTTSR model outperforms the compared methods. This upgraded model significantly improves the accuracy of the previous state-of-the-art F1 score on complex tables, reaching 97.8%.

Keywords:

image table; table detection; table structure recognition; formula recognition; transformer; Do-Conv

1. Introduction

There is currently a high demand from enterprises and organizations regarding non-structured text recognition for academics and the industry. The recognition of three-line, half-line, and complex tables containing special characters and formulas needs to be revised. Many researchers use deep learning methods to solve the problems of table recognition of unstructured text. For example, Tejas et al. [1] proposed an end-to-end table structure recognition in scanned documents. CascadeTabNet [2] generates table structures with cell-level information to stress the complex relationship of merged cells. There is no better solution for blurry tables and curved table borders.

Currently, fast and accurate table content recognition is still a challenge because of irregular forms, uneven text quality, and complex and diverse table content. This paper proposes a novel form recognition algorithm UTTSR that includes table structure sequence recognition, text detection and recognition to extract non-structural text content. Our contributions are summarized as follows:

(1): Table detection utilizes Cascade Faster R-CNN with the ResNeXt105 to extract image features. Additionally, TPS transformation and radiation are employed to enhance the accuracy of table detection.
(2): Improved DBNet and attention-based CRNN are utilized for the detection and recognition phase of text lines in order to identify simple formulas and special characters.
(3): TableMaster is implemented to convert the resulting table structure into an HTML format that can be easily used for future applications.

The remaining part of this article is structured as follows. Section 2 provides an overview of related works on table recognition. Section 3 describes our methodology, while Section 4 presents our experimental results and analysis. Finally, we conclude the article in Section 5.

2. Related Work

2.1. Table Detection

Early methods for table detection have relied on image processing and some text information. For example, Watanabe et al. [3] and Hirayama [4] processed scanned documents to obtain text blocks and horizontal and vertical lines that could be used to locate tables. Ramel et al. [5] found the first horizontal line at the top of the table and searched for other regions by matching 4 “T”-shaped templates in nine cases where frame lines intersected. Watanabe et al. [6] used horizontal and vertical lines, as well as the upper left corner of the cell, as a reference point to determine the table position.

In recent years, numerous deep-learning-based detection algorithms for table detection have emerged. For instance, Schreiber et al. [7] adopted Faster R-CNN [8] for table detection. Li et al. [9] proposed a progressive scale expansion network (PSENet), which accurately detects text of any shape and solves the problem of relatively poor cell detection. Zhang et al. [10] proposed the VSR network, which combines the visual and semantic information of documents.

2.2. Table Structure Recognition

Table structure recognition involves identifying the layout or hierarchical structure of the table’s content, which mainly includes the specific position of the cell’s row and column, as well as the relationship between cells. Early researchers typically designed heuristic algorithms to complete table structure recognition. For example, Rahgozar et al. [11] presented a table structure recognition method based on rows and columns and first identified text blocks in the image. Khan et al. [12] used a sequence model to extract the table structure and to classify rows and columns. Xue et al. [13] proposed a trainable table graph reconstruction network to manage and extract meaningful information for table structure. Ye et al. [14] divided table recognition into four sub-tasks: table structure detection, text line detection, text line recognition, and box assignment. Pascal et al. [15] proposed multi-type TD-TSR for table detection, providing an end-to-end solution for table recognition.

In recent years, numerous deep-learning-based methods have been proposed for table structure recognition. For instance, the DeepDeSRT system [7] uses FCN [16] as the basic architecture to perform semantic segmentation of rows and columns. Paliwal et al. [17] proposed a novel terminal deep learning model, Tablenet, that processes table detection and structure recognition using the semantic segmentation framework. DeepTabStR [18] applies variable convolution to the target detection network, detects rows, columns, and cells simultaneously, and restores the table according to the location characteristics of the cells. PubTabNet [19] provides the HTML code of the table structure and the text content of each cell. In the encoder-dual decoder model, EDD, a single convolutional neural network is used in the encoding stage, while two recurrent neural networks are used in the decoding stage. Qiao et al. [20] proposed the LGPMA network, which focuses on cell detection. The network fully utilizes local and global visual features to obtain more reliable aligned cell regions through the proposed mask re-scoring strategy, and it uses soft labels to subtly solve the interference of blank cells. Long et al. [21] proposed the Cycle-CenterNet for table structure recognition on a natural scene table recognition dataset WTW. Based on CenterNet [22], the center of the cell and the intersection of four cells are detected simultaneously, and the table structure is directly restored after cell detection is completed.

2.3. Table Content Recognition

Table content recognition involves recognizing the text within a cell, while also performing tasks such as table classification, cell classification, and information extraction based on the table content. Zhong et al. [19] introduced a novel measure called TEDS for table recognition, which effectively balances errors in text content recognition and table structure recognition. Li et al. [23] presented Tablebank, a new dataset for image-based table detection and recognition, constructed using weak supervision from Word and LaTeX documents found online. This dataset aims to empower deep learning methods in table-detection and recognition tasks.

Lu et al. [24] proposed a self-attention-based mechanism for scene text recognition, which learns self-attention while encoding input–output attention. Li et al. [25] developed a graph-based convolutional network GFTE and an open-source financial benchmark dataset FinTab. Meanwhile, Yang et al. [26] devised a two-channel network for formula recognition in table cells, although it must be used with other structure recognition models in real-world scenarios.

Wu et al. [27] developed a text-to-table approach. Ly et al. [28] proposed an end-to-end multitask learning model for image-based table recognition, consisting of a shared encoder, a shared decoder, and three independent decoders. Smock et al. [29] solved the issue of misalignment in the benchmark dataset for table structure recognition. Wang et al. [30] introduced a method that is robust to S-TSR sub-scenarios with different challenges, while Huang et al. [31] utilized visual alignment sequential coordinate modeling to enhance table recognition and to produce better cell boundary boxes.

Kazdar et al. [32] conducted table detection and table structure recognition using deep learning and heuristic methods. Zhou et al. [33] were able to identify noisy images, while Nassar et al. and Lee et al. [34] proposed a new graph-based framework for table structure recognition designed to handle complex tables. Furthermore, Nassar et al. [35] presented a new table recognition model that can extract table contents directly from PDF sources, but it needs improvement for complex tables. Smock et al. [36] introduced the PubTable-1M dataset, which contains nearly a million articles from science, supports multiple input forms, and has detailed title and location information for table structures.

3. Methods

Our method utilizes the Cascade Faster R-CNN to detect tables and employs TPS transformation and radiation transformation techniques to correct images and to improve the accuracy of table detection. In addition, we utilize DBNet to detect text lines and CRNN and attention models to identify text within these lines. The table information is encoded using a transformer and is then decoded into two processing paths. One path deals with text position regression for cell text, while the other focuses on extracting features of the table structure. The final table structure recognition result is obtained by logically combining the outputs of both paths, which include three features: cell text position, cell text content, and table structure. The overall structure is shown in Figure 1.

In Figure 1, there are two main branches. One branch consists of the core TableMaster component, which uses the transformer to encode table information and then splits it into two processing lines for regression of cell text position and extraction of table structure features via a decoder. The other branch first goes through Cascade Faster R-CNN and splits into two processing lines: one for regressing the cell text position and fusing it with the corresponding output line from TableMaster to obtain the final cell text position feature and the other for using the CRNN module to obtain the cell text content feature. Finally, the three outputs from both branches, namely the cell text position feature, cell text content feature, and table structure feature, are logically added together and are subjected to post-processing to obtain the final table structure recognition result.

3.1. Regional Detection Based on Cascade Faster R-CNN

Table recognition involves three steps, namely table detection, text recognition, and table structure recognition. In this article, we use an improved Cascade Faster R-CNN to carry out the table-detection task, as shown in Figure 2.

In Figure 2, the cascaded detection method adds

H 4

and

H 5

detectors with higher thresholds to the traditional method. The detector thresholds for

H 1

,

H 2

,

H 3

,

H 4

, and

H 5

are 0.5, 0.6, 0.7, 0.8, and 0.9, respectively. Different inputs are used at different stages of the Cascade Faster R-CNN to further improve detection accuracy by utilizing output information from previous stages. In addition, introducing detectors with higher thresholds can filter out more accurate detection results and produce final results with greater accuracy.

To preserve more features, a multi-scale fusion network is incorporated into the feature pyramid network (FPN). Following feature extraction, features are fused with the

P 2

,

P 3

,

P 4

, and

P 5

stages.The FPN and region proposal network (RPN) are fitted with

H 4

and

H 5

detectors, with detector thresholds (intersection over union, IOU) of

H 1

,

H 2

,

H 3

,

H 4

and

H 5

set at 0.5, 0.6, 0.7, 0.8, and 0.9, respectively. The structure of the Cascade Faster R-CNN relies on a sequence of specialized regression factors.

f (x, b) = f_{T} \circ f_{(T - 1)} \circ \dots \circ f_{1} (x, b)

(1)

where T is the total amount of the number of cascades, and ∘ denotes the cascading method. In the fourth class, each regression factor

f_{T}

is optimized using the sample distribution of the corresponding stage

b^{t}

, rather than the initial distribution

b^{t}

. The optimization process begins with a set of examples

(x_{i}, b_{i})

, and at each level couplet, it returns to the sample distribution of the higher IOUs example

(x_{i}^{t}, x_{i}^{t})

. This approach ensures that the number of positive models at each stage remains constant, even if the IOU threshold is increased.

In each T stage, Faster R-CNN incorporates a classifier

h_{t}

and a regression module

f_{t}

optimized for the IOU threshold, where the threshold size

u^{t} > u^{(t - 1)}

. The loss function is defined as:

L (x^{t}, g) = L_{c l s} (h_{t} (x^{t}), y^{t}) + λ [y^{t} \geq 1] \times L_{l o c} (f_{t} (x^{t}, b^{t}), g)

(2)

where

b^{t} = f_{(t - 1)} (x^{(t - 1)}, b^{(t - 1)})

(3)

and

y^{t} = \{\begin{matrix} g_{y}, & I O U (x^{t}, g) > u \\ 0, & o t h e r w i s e \end{matrix}

(4)

For

x^{t}

, g is the benchmark object, and

λ = 1

is the coefficient to balance two types of losses.

y^{t}

is the label of

x^{t}

, and

[y^{t} \geq 1]

is the Averson bracket. If

y^{t} \geq 1

, then a certain type of target that needs to be detected. Otherwise, it means this item is the background.

If the text in a table is not arranged horizontally after the form test, it is usually necessary to correct the text orientation. This paper employs the TPS transformation with four-corners calibration, as illustrated in Figure 3. Removing the warping and perspective transformation of a table obtains a regular table.

3.2. DBNet Text Detection Based on Do-Conv

DBNet (Figure 4) is a text-detection method based on image segmentation. It utilizes a ResNet-based structure for feature extraction, which helps to reduce time costs and to improve prediction speed. Furthermore, the Do-Conv method is employed to accelerate the training process and to achieve faster results.

The backbone of DBNet utilizes ResNeSt50 to extract multi-scale features from the input image. After applying FPN, the size of the output feature map is reduced to a quarter of the original image. The head module is then responsible for computing the probability map, threshold map, and binary map for the text region.

The standard binarization function is problematic, as it is discontinuous, non-differentiable, and cannot be optimized for semantic segmentation training. To address this issue, a differentiable binarization approach was proposed to approximate the step function in standard binarization. The differentiable binarization technique is defined as follows:

B_{(i, j)} = \frac{1}{1 + e^{(- k (P_{(i, j)} - T_{(i, j)})}}

(5)

where

\hat{B}

is the two-value diagram of the output, P is the probability chart obtained, T is the threshold diagram obtained, and K is the gain factor.

During the training process, the probability map, threshold map, and binary map are outputted. To calculate the loss function, these three maps and their corresponding annotations are combined to create three partial loss functions. The total loss function is defined as follows:

L = L_{s} + α \times L_{b} + β \times L_{t}

(6)

where L is the total loss,

L_{s}

is the probability diagram loss,

L_{b}

is the dual-value diagram loss,

L_{t}

is the threshold diagram loss, and

α

and

β

are the weight coefficients. Both of the loss calculations of

L_{s}

and

L_{b}

use Bceloss. The wrong question set strategy is used during the loss calculation process to balance the positive samples. The loss function of the probability diagram and the two-value diagram is defined as:

L_{s} = L_{b} = \sum_{(i ϵ S_{t})}^{} y_{i} l o g x_{i} + (1 - y_{i}) l o g (1 - x_{i})

(7)

where

S_{t}

denotes the proportion of positive samples with a ratio of 1:3.

L_{t}

is the sum of the prediction results in the extended polygon

G_{d}

(label) and the

L_{1}

distance between the predicted threshold map and its corresponding ground truth label. The loss function for the threshold map is defined as follows:

L_{t} = Σ_{(i ϵ R_{d})} y_{i}^{*} - x_{i}^{*}

(8)

where

R_{d}

is an index in the extended polygon

G_{d}

, and

y_{i}^{*}

is a threshold icon tag.

3.3. Attention-Based CRNN Text Recognition

In this paper, CRNN is used as the text recognition network in combination with an attention structure. The attention mechanism calculates weights for the query and is key to achieving better performance. The entire process is illustrated in Figure 5.

In Figure 5, the attention mechanism describes a method where weights are calculated for each value based on its similarity or correlation with the given query and keys. These weights are then used to compute a weighted sum of values, resulting in an output. Thus, the entire process maps the input query and a series of key-value pairs to an output through the use of weight coefficients.

The CTC used in CRNN is decoded based on frame sequences, which can result in the loss of overall information of the formula. To improve recognition performance in scenarios involving special characters and formulas within tables, this paper proposes an observation- and parser-based approach. The observator is a full convolutional encoder that maps the input image to advanced features. The parser is a circular neural decoder that converts these advanced features to the output sequence. For each prediction, the parser’s built-in attention mechanism scans the entire input image and selects the most relevant region to describe the segmentation symbol or hidden space operator. This approach helps to retain the contextual information of each predicted word and to preserve the general knowledge of the formula, as illustrated in Figure 6.

As demonstrated in Figure 6. This attention mechanism is employed on the test image where it concentrates more on the “P” on the left and the area between “P” and “3” on the right. With the aid of contextual information, the model generates the outputs “P” and “_”.

3.4. Sequence Structure Recognition Based on Transformer

After completing text detection and text recognition, the next step is to perform table structure recognition. The TableMaster model for sequence recognition consists of two parts: encoding and decoding, as illustrated in Figure 7.

Each residual connector contains multiple convolutional layers and a normalized layer, which can effectively reduce gradient disappearance and overfitting. After each residual connection block, there is the multi-pass channel attention module Multi-Aspect GCAttention (MAGCA) that is used to enhance the characteristics without increasing the parameters. MAGCA is defined as:

\{\begin{matrix} y = x + δ (M A G C (x)), \\ M A G C (x) = C o n c a t (g c_{1}, g c_{2}, \dots, g c_{h}), \\ g c_{i} = \sum_{j = 1}^{L} α_{j} χ_{j}, \\ α = s o f t m a x (\frac{(w_{k} x_{1})}{\sqrt{d_{h}}}, \frac{(w_{k} x_{2})}{\sqrt{d_{h}}}, \dots, \frac{(w_{k} x_{L})}{\sqrt{d_{h}}},) \end{matrix}

(9)

where h is the number of bulls.

The TableMaster is utilized for encoding and decoding as shown in Figure 8. The encoding phase converts a picture into a sequence to extract the feature representation of the input sequence. The decoding part adds a branch based on the transformer for cell position prediction and table structure sequence prediction.

The decoder in the TableMaster model consists of N = 5 identical layers that are stacked on top of each other, with each layer consisting of three sublayers. The first sublayer performs multiple attention operations on the output of the encoder stack. To prevent positions from focusing on subsequent positions, the self-attention sublayer is modified with masking and position offsets to ensure that the position prediction only relies on the known output. The second sublayer is the multi-head self-attention layer, while the third sublayer is a simple fully connected feedforward network. These sublayers are depicted in Figure 9.

4. Experiment

4.1. Dataset

In our experiment, the ICDAR 2015, Marmot [37], SCITSR [38] and PubTabnet [19] datasets were used. ICDAR 2015 (IC15) is a commonly used dataset for text detection. It contains a total of 1500 pictures, in which 1000 are used for training and the rest for testing. The text region is annotated by four top-sized points of the quadrilateral shape. Marmot DataSet1 contains 2000 pages in PDF format, most of which come from research papers. It consists of Chinese and English pages, storing the tree structure of all document layouts. SCITSR is a comprehensive dataset. It consists of 15,000 images in PDF format, the image region, corresponding structural labels, and the boundary frames of each cell. It is divided into 120,000 training and 3000 tests. At the same time, a complex table is also provided, called scitsr-comp. The PUBTABNET dataset contains 5.68 million images (PNG) including the corresponding HTML of the table representing annotation. The surface structure and characters are provided, but the boundary frame is lacking.

4.2. Experimental Settings

4.2.1. Experimental Environment

In Cascade Faster RCNN training, 8 Tesla V100 GPUs are used with a batch size of 4 per GPU. The initial learning rate is 0.001, and the epoch is 400. In the table structure training, 8 Tesla V100 GPUs are used, and the batch size of each GPU is 6. The input image size is 480 × 480, and the maximum sequence length is 500. The initial learning rate is 0.001. In the training of text line recognition, 8 Tesla V100 GPUs are used with a batch size of 64 per GPU. The input size is 256 × 48, and the maximum length is 100. Simultaneous BN and ranger optimizers are used simultaneously, and the hyperparameter settings are the same as the table structure training.

The experiment uses Python version 3.7, Pytorch framework version 1.8.1, CUDA version 11.1, and cuDNN version 8.0.5, which can run the code in this article. This paper’s model training and inference were performed on NVIDIA GTX 3090 and Intel i9-9900k@5GHz.

4.2.2. Evaluation Index

In our experiments, we utilized accuracy (precision), recall rate (recall), F1 score, and TEDS (tree edit distance similarity) to evaluate the effectiveness of our proposed approach. TEDS measures the similarity between predicted and ground-truth HTML tag tree structures. The formula is as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(10)

R e c a l l = \frac{T P}{T P + F N}

(11)

F 1 = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(12)

T E D S (T_{a}, T_{b}) = 1 - \frac{E d i t D i s t (T_{a}, T_{b})}{m a x (|T_{a}|, |T_{b}|)}

(13)

where

T_{a}

and

T_{b}

represent tables in tree-structure HTML format. EditDist denotes the tree-edit distance, and

| T |

represents the number of nodes in T.

4.3. Comparative Experiment

This paper evaluated our proposed models using the ICDAR 2015, MARMOT, SCITSR, and PUBTABNET table recognition datasets. To compare with other existing table recognition models such as Splerge and TableNet, we present the experimental results of different models on these four datasets in Table 1.

Table 2 shows the evaluation results of Faster RCNN, Cascade Faster R-CNN, and Cascade Faster RCNN+FPN on the SCITSR and Marmot datasets for table detection. The accuracy rate of Faster R-CNN is only 62.7%. The other two models use ResNeXt101 and ResNeXt105 as backbones, with ResNeXt105 being more accurate due to having more parameters than ResNeXt101.

The detection accuracy of the improved Cascade Faster RCNN+FPN model is 81.40%, which is a significant improvement compared to Faster R-CNN and Cascade Faster RCNN. This demonstrates that the improved FPN+Cascade-RCNN model is highly effective and robust, with better overall performance when compared to other networks.

For table structure recognition, we conducted experiments on datasets using the traditional algorithm + PDF, the layout parser method, and TableMaster sequence recognition, as shown in Table 3. The F1 score of TableMaster was found to be higher than that of the layout parser algorithm. Additionally, its accuracy was slightly better compared to the layout parser and the combined image with PDF information method, as presented in Table 3.

For text line detection, we conducted experiments using DBNet, DBNet++, and Do-cov-based DBNet on the ICDAR2015 dataset. The performance of these models for text line recognition is presented in Table 4.

The DO-Conv-based DBNet method outperformed DBNet and DBNet++ in terms of precision and F1 scores. Specifically, on the ICDAR2015 dataset, the DBNet based on DO-Conv had a precision of 2 percentage points higher than the original DBNet and 2.8 percentage points higher than the DBNet++. Regarding recall, the DO-Conv-based DBNet was 0.5 percentage points lower than the original DBNet and 1.5 percentage points lower than the DBNet++. In terms of F1 score, our proposed DO-Conv-based DBNet achieved a performance that was 0.5 percentage points higher than the original DBNet and 0.3 percentage points higher than the DBNet++. Overall, the DO-Conv-based DBNet method demonstrated better performance compared to the other DBNet methods.

4.4. Ablation Experiment

To further validate the effectiveness of each model in UTTSR, this paper conducted ablation experiments on the PubTabnet dataset, as shown in Table 5. It is evident that the performance of our proposed models gradually improved with the increase in and fusion of various methods, achieving a high accuracy of 0.9785 when processing tables. This indicates that the UTTSR model can achieve better overall performance for table recognition tasks.

4.5. Implementation Details

There are various types of tables, with the most common being the three-line table and the half-line table. As the complexity of table structures continues to increase, processing and analyzing table information also become more challenging. The dataset sample is shown in Figure 10.

In Figure 11, we show a result example of sequence prediction and box regression. We can see that the structure can predict out-of-the-box coordinates correctly.

The position annotations in the dataset are at the cell level, and cropped text images according to the position annotation in the dataset contain both single-line and multi-line text images. This is shown in Figure 12.

5. Conclusions

This paper focuses on table recognition with complex structures and presents the UTTSR network. The proposed network is designed to identify complex table structures and cell contents from unstructured text by dividing the table model into four parts: table region detection, text line detection, text line recognition, and table sequence recognition. Compared with other methods, our approach is superior to other methods to a great extent and can solve the problem of table structure recognition in most scenarios.Various table recognition methods have been developed prior to this, including those based on image segmentation and end-to-end deep learning, which have achieved varying degrees of success. However, these methods still encounter certain limitations such as their inability to detect complex table structures and handle the problem of spanning cell recovery. To overcome these challenges, the UTTSR network draws inspiration from the transformer model and integrates multiple modalities, including visual features of images, positional relationship features of text, and semantic features of text. By doing so, it significantly enhances the accuracy and robustness of complex table structure recognition.

In the future, as the volume of data in various industries continues to increase, the demand for data-driven analysis of tabular information will also grow. Therefore, we need to further explore optimization methods for data analysis to solve the identification problems of various table structures and data types that arise in different industries and fields. At the same time, we also need to combine multiple technical means such as document technology, information extraction, and complex layout recognition methods to further enhance the processing and analysis capabilities of tabular information, providing a guarantee to better achieve intelligence and digitization.

Author Contributions

Conceptualization, M.L. and L.Z.; methodology, L.Z.; software, L.Z.; validation, M.L., L.Z. and D.H.; formal analysis, M.Z.; investigation, M.L.; resources, D.H.; data curation, L.Z.; writing—original draft preparation, L.Z.; writing—review and editing, L.Z.; visualization, M.L.; supervision, M.Z.; project administration, D.H.; funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Shandong Provincial Natural Science Foundation (ZR2022QG016) and by the the Pilot Project for Integrated Innovation of Science, Education, and Industry of Qilu University of Technology (Shandong Academy of Sciences) (2022JBZ01-01).

Data Availability Statement

Datasets can be accessed upon request to the corresponding author.

Acknowledgments

The authors would like to thank all the anonymous reviewers for their insightful comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kashinath, T.; Jain, T.; Agrawal, Y.; Anand, T.; Singh, S. End-to-end table structure recognition and extraction in heterogeneous documents. Appl. Soft Comput. 2022, 123, 108942. [Google Scholar] [CrossRef]
Prasad, D.; Gadpal, A.; Kapadni, K.; Visave, M.; Sultanpure, K. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 572–573. [Google Scholar]
Watanabe, T.; Luo, Q.; Sugie, N. Structure recognition methods for various types of documents. Mach. Vis. Appl. 1993, 6, 163–176. [Google Scholar] [CrossRef]
Hirayama, Y. A method for table structure analysis using DP matching. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 2, pp. 583–586. [Google Scholar]
Ramel, J.Y.; Crucianu, M.; Vincent, N.; Faure, C. Detection, extraction and representation of tables. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, Edinburgh, UK, 6 August 2003; pp. 374–378. [Google Scholar]
Watanabe, T.; Luo, Q.; Sugie, N. Toward a practical document understanding of table-form documents: Its framework and knowledge representation. In Proceedings of the 2nd International Conference on Document Analysis and Recognition (ICDAR’93), Tsukuba, Japan, 20–22 October 1993; pp. 510–515. [Google Scholar]
Schreiber, S.; Agne, S.; Wolf, I.; Dengel, A.; Ahmed, S. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1162–1167. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, 7–12 December 2015; Volume 28. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9336–9345. [Google Scholar]
Zhang, P.; Li, C.; Qiao, L.; Cheng, Z.; Pu, S.; Niu, Y.; Wu, F. VSR: A unified framework for document layout analysis combining vision, semantics and relations. In Document Analysis and Recognition–ICDAR 2021, Proceedings of the 16th International Conference, Lausanne, Switzerland, 5–10 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; Part I, pp. 115–130. [Google Scholar]
Rahgozar, M.A.; Fan, Z.; Rainero, E.V. Tabular document recognition. In Document Recognition Proceedings of the 1994 International Symposium on Electronic Imaging: Science and Technology, San Jose, CA, USA, 6–10 February 1994; SPIE: Bellingham, WA, USA, 1994; Volume 2181, pp. 87–96. [Google Scholar]
Khan, S.A.; Khalid, S.M.D.; Shahzad, M.A.; Shafait, F. Table structure extraction with bi-directional gated recurrent unit networks. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1366–1371. [Google Scholar]
Xue, W.; Yu, B.; Wang, W.; Tao, D.; Li, Q. Tgrnet: A table graph reconstruction network for table structure recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1295–1304. [Google Scholar]
Ye, J.; Qi, X.; He, Y.; Chen, Y.; Gu, D.; Gao, P.; Xiao, R. PingAn-VCGroup’s solution for ICDAR 2021 competition on scientific literature parsing task B: Table recognition to HTML. arXiv 2021, arXiv:2105.01848. [Google Scholar]
Fischer, P.; Smajic, A.; Abrami, G.; Mehler, A. Multi-Type-TD-TSR–Extracting Tables from Document Images Using a Multi-stage Pipeline for Table Detection and Table Structure Recognition: From OCR to Structured Table Representations. In KI 2021: Advances in Artificial Intelligence, Proceedings of the 44th German Conference on AI, Virtual Event, 27 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 95–108. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Paliwal, S.S.; Vishwanath, D.; Rahul, R.; Sharma, M.; Vig, L. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 128–133. [Google Scholar]
Siddiqui, S.A.; Fateh, I.A.; Rizvi, S.T.R.; Dengel, A.; Ahmed, S. Deeptabstr: Deep learning based table structure recognition. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1403–1409. [Google Scholar]
Zhong, X.; ShafieiBavani, E.; Jimeno Yepes, A. Image-based table recognition: Data, model, and evaluation. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; Part XXI, pp. 564–580. [Google Scholar]
Qiao, L.; Li, Z.; Cheng, Z.; Zhang, P.; Pu, S.; Niu, Y.; Ren, W.; Tan, W.; Wu, F. Lgpma: Complicated table structure recognition with local and global pyramid mask alignment. In Document Analysis and Recognition–ICDAR 2021, Proceedings of the 16th International Conference, Lausanne, Switzerland, 5–10 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; Part I, pp. 99–114. [Google Scholar]
Long, R.; Wang, W.; Xue, N.; Gao, F.; Yang, Z.; Wang, Y.; Xia, G.S. Parsing table structures in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 944–952. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Li, M.; Cui, L.; Huang, S.; Wei, F.; Zhou, M.; Li, Z. Tablebank: A benchmark dataset for table detection and recognition. arXiv 2019, arXiv:1903.01949. [Google Scholar]
Lu, N.; Yu, W.; Qi, X.; Chen, Y.; Gong, P.; Xiao, R.; Bai, X. Master: Multi-aspect non-local network for scene text recognition. Pattern Recognit. 2021, 117, 107980. [Google Scholar] [CrossRef]
Li, Y.; Huang, Z.; Yan, J.; Zhou, Y.; Ye, F.; Liu, X. GFTE: Graph-based financial table extraction. In Proceedings of the Pattern Recognition, ICPR International Workshops and Challenges, Virtual Event, 10–15 January 2021; Springer: Berlin/Heidelberg, Germany, 2021. Part II. pp. 644–658. [Google Scholar]
Yang, Q.; Cao, Y.; Li, H.; Luo, P. Numerical Formula Recognition from Tables. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &Data Mining, Singapore, 14–18 August 2021; pp. 1986–1996. [Google Scholar]
Wu, X.; Zhang, J.; Li, H. Text-to-table: A new way of information extraction. arXiv 2021, arXiv:2109.02707. [Google Scholar]
Ly, N.T.; Takasu, A. An End-to-End Multi-Task Learning Model for Image-based Table Recognition. arXiv 2023, arXiv:2303.08648. [Google Scholar]
Smock, B.; Pesala, R.; Abraham, R. Aligning benchmark datasets for table structure recognition. arXiv 2023, arXiv:2303.00716. [Google Scholar]
Wang, H.; Xue, Y.; Zhang, J.; Jin, L. Scene table structure recognition with segmentation collaboration and alignment. Pattern Recognit. Lett. 2023, 165, 146–153. [Google Scholar] [CrossRef]
Huang, Y.; Lu, N.; Chen, D.; Li, Y.; Xie, Z.; Zhu, S.; Gao, L.; Peng, W. Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 11134–11143. [Google Scholar]
Kazdar, T.; Jmal, M.; Souidene, W.; Attia, R. Table Recognition in Scanned Documents. In Computational Collective Intelligence, Proceedings of the 14th International Conference, ICCCI 2022, Hammamet, Tunisia, 28–30 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 744–754. [Google Scholar]
Zhou, M.; Ramnath, R. A Structure-Focused Deep Learning Approach for Table Recognition from Document Images. In Proceedings of the 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), Los Alamitos, CA, USA, 27 June–1 July 2022; pp. 593–601. [Google Scholar]
Lee, E.; Park, J.; Koo, H.I.; Cho, N.I. Deep-learning and graph-based approach to table structure recognition. Multimed. Tools Appl. 2022, 81, 5827–5848. [Google Scholar] [CrossRef]
Nassar, A.; Livathinos, N.; Lysak, M.; Staar, P. Tableformer: Table structure understanding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4614–4623. [Google Scholar]
Smock, B.; Pesala, R.; Abraham, R. PubTables-1M: Towards comprehensive table extraction from unstructured documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4634–4642. [Google Scholar]
Fang, J.; Tao, X.; Tang, Z.; Qiu, R.; Liu, Y. Dataset, ground-truth and performance metrics for table detection evaluation. In Proceedings of the 2012 10th IAPR International Workshop on Document Analysis Systems, Gold Coast, Australia, 27–29 March 2012; pp. 445–449. [Google Scholar]
Chi, Z.; Huang, H.; Xu, H.D.; Yu, H.; Yin, W.; Mao, X.L. Complicated table structure recognition. arXiv 2019, arXiv:1908.04729. [Google Scholar]

Figure 1. Table used to identify the overall structure diagram.

Figure 2. Improved Cascade R-CNN with five detection heads.

Figure 3. TPS transformation and imitation transformation.

Figure 4. DBNet on Do-Conv.

Figure 5. Attention mechanism flowchart.

Figure 6. Attention mechanism flowchart.

Figure 7. Serial identification TableMaster model architecture.

Figure 8. TableMaster architecture.

Figure 9. Transformer layer internal structure.

Figure 10. (a,b) Marmot; (c,d) SciTSR; (e,f) PubTabNet dataset.

Figure 11. Example of table structure prediction. Predicted bounding box are marked with green color.

Figure 12. Example of text line images cropped from training data of PUbTabNet dataset; (a) single line text image; (b) multi-lines text image.

Table 1. Performance comparison of UTTSR, SPLETGE and TableNet on multiple datasets.

Dataset	Methods	Precision (%)	Recall (%)	F1 (%)	TEDS (%)
ICDAR2015	SPLETGE	99.2	98.8	98.99	-
	UTTSR (ours)	99.5	99.4	99.45	-
	TableNet	97.4	96.2	96.80	-
Marmote	SPLETGE	98.8	99.0	98.90	-
	UTTSR (ours)	99.2	98.7	98.94	-
	TableNet	93.1	90.1	91.58	-
PubTabnet	SPLETGE	-	-	-	97.1
	UTTSR (ours)	-	-	-	98.5
	TableNet	-	-	-	96.7
SCITSR	SPLETGE	85.4	82.3	83.82	-
	UTTSR (ours)	95.2	94.8	95.00	-
	TableNet	92.2	89.8	90.98	-

Table 2. Three network model comparisons of the table-detection method.

Method	Backbone	Precision (%)
Faster RCNN	VGG16	62.70
Cascade Faster RCNN	ResNet101	79.20
Cascade Faster RCNN	ResNeXt105	80.10
Cascade Faster RCNN+ FPN	ResNet101	80.50
Cascade Faster RCNN+ FPN	ResNeXt105	81.40

Table 3. Table structure detection method comparison.

Method	Precision (%)	Recall (%)	F1 (%)
Image Algorithm+PDF	90.00	30.36	72.0
Layout Parser	95.25	84.16	89.0
TableMaster	96.95	93.07	95.0

Table 4. The test results of the text recognition method on the ICDAR2015 dataset.

Method	Precision (%)	Recall (%)	F1 (%)
DBNet	88.60	76.50	82.10
DBNet++	87.80	77.50	82.60
DBNet (Do-cov)	90.60	76.00	82.70

Table 5. The ablation experiments. TD: table detection; TSR: table structure recognition; TLD: text line detection; TLR: text line recognition.

TD	TSR	TLD	TLR	F1 (%)
Cascade Faster R-CNN	Transformer	DBNet (Do-cov)	CRNN	F1 (%)
✓	×	×	×	82.45
✓	✓	×	×	86.42
✓	✓	✓	×	92.31
✓	✓	✓	✓	97.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, M.; Zhang, L.; Zhou, M.; Han, D. UTTSR: A Novel Non-Structured Text Table Recognition Model Powered by Deep Learning Technology. Appl. Sci. 2023, 13, 7556. https://doi.org/10.3390/app13137556

AMA Style

Li M, Zhang L, Zhou M, Han D. UTTSR: A Novel Non-Structured Text Table Recognition Model Powered by Deep Learning Technology. Applied Sciences. 2023; 13(13):7556. https://doi.org/10.3390/app13137556

Chicago/Turabian Style

Li, Min, Liping Zhang, Mingle Zhou, and Delong Han. 2023. "UTTSR: A Novel Non-Structured Text Table Recognition Model Powered by Deep Learning Technology" Applied Sciences 13, no. 13: 7556. https://doi.org/10.3390/app13137556

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UTTSR: A Novel Non-Structured Text Table Recognition Model Powered by Deep Learning Technology

Abstract

1. Introduction

2. Related Work

2.1. Table Detection

2.2. Table Structure Recognition

2.3. Table Content Recognition

3. Methods

3.1. Regional Detection Based on Cascade Faster R-CNN

3.2. DBNet Text Detection Based on Do-Conv

3.3. Attention-Based CRNN Text Recognition

3.4. Sequence Structure Recognition Based on Transformer

4. Experiment

4.1. Dataset

4.2. Experimental Settings

4.2.1. Experimental Environment

4.2.2. Evaluation Index

4.3. Comparative Experiment

4.4. Ablation Experiment

4.5. Implementation Details

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI