# A Proposal of Printed Table Digitization Algorithm with Image Processing

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Works in the Literature

## 3. Adopted Technologies

#### 3.1. OpenCV

- Recognize every intersection of the horizontal line and the vertical line to find the location of the cells in the table.
- Divide the table into a collection of rectangles using the intersections.
- Arrange the rectangles from left to right and top to bottom by recognizing the boundary of the horizontal and vertical lines.
- Recognized rectangles sometimes are too large or too small. The rectangle will have multiple cells if it is too large, and it will have none if it is too small. In order to prevent both of these situations, we restrict the rectangle’s size range.

#### 3.2. Tesseract

## 4. Target Table Layout

#### 4.1. Table Layout

- The table basically has a grid structure of multiple rows and multiple columns.
- Two adjacent horizontal lines make one row, and two adjacent vertical lines make one column in the table.
- All the vertical lines connect the top horizontal line and the bottom horizontal line.
- Some horizontal lines including the top one and the bottom one connect the leftmost vertical line and the rightmost vertical line.
- Other horizontal lines connect the intermediate vertical line and the rightmost vertical line.
- A cell represents the rectangle area surrounded by two adjacent horizontal lines and two adjacent vertical lines.
- The number of cells in every row is equal.
- The number of cells in one column may be different by the column.
- A cell has one or multiple cells at the right neighbor and has one cell at the top/ bottom neighbors.

#### 4.2. Example Table

#### 4.3. Topology Information

## 5. Proposed Algorithm

#### 5.1. Symbols in Algorithm

- I: the scanned color image of the paper.
- ${I}_{bin}$: reversed binary image of I.
- ${k}_{ver}$: the definition of a vertical kernel for detecting all vertical lines in an image.
- ${k}_{hor}$: the definition of a horizontal kernel for detecting all horizontal lines in an image.
- ${I}_{ver}$: the binary image with only vertical lines of I.
- ${L}_{ver}$: the vertical lines of ${I}_{ver}$.
- ${I}_{hor}$: the binary image with only horizontal lines of I.
- ${L}_{hor}$: the horizontal line of $Ihor$.
- ${I}_{vh}$: combine ${L}_{ver}$ and ${L}_{hor}$.
- ${L}_{cnts}$: the contours of ${I}_{vh}$.
- ${l}_{i}$: each contour of ${L}_{cnts}$.
- ${r}_{i}$: the rectangles of ${l}_{i}$.
- R: the array of ${r}_{i}$.
- ${L}_{cnts1}$: the sorting result of ${L}_{cnts}$.
- ${R}_{1}$: the R after sorting.
- ${R}_{2}$: the array of rectangular which only contains a single cell.
- ${c}_{i}$: each Cell class.
- C: the array of ${c}_{i}$
- ${x}_{i}$: the horizontal coordinate of ${c}_{i}$.
- ${y}_{i}$: the vertical coordinates of ${c}_{i}$.
- ${w}_{i}$: the width of ${c}_{i}$.
- ${h}_{i}$: the height of ${c}_{i}$.
- $nam{e}_{i}$: the name of ${c}_{i}$ that uses pytesseract to recognize.
- $row$: the number of rows in I.
- $Root$: the root node of the cardinal tree.
- $ro{w}_{ci}$: the ${c}_{i}$ of each column, for example, when $row=1$, we have $row{1}_{ci}$, when $row=2$, we have $row{2}_{ci}$
- $ro{w}_{ci+1}$: the ${c}_{i}$ of each column, for example, $row{1}_{ci+1}$ is the next class of $row{1}_{ci}$.
- $Tree$: the structure of I’s table.
- E: the output Excel table.

#### 5.2. Procedure in Algorithm

- (1)
- Make the color image I of each table by scanning the paper.
- (2)
- Transform the color image I into the black-and-white binary image ${I}_{bin}$ by applying the thresholds function in OpenCV to I.
- (3)
- Reference [9], obtain the vertical kernel ${k}_{ver}$ and the horizontal kernel ${k}_{hor}$ by applying the getStructuringElement function in OpenCV to ${I}_{bin}$.
- (4)
- (5)
- Extract the vertical lines ${L}_{ver}$ by applying the dilate function in OpenCV to ${I}_{ver}$.
- (6)
- (7)
- Extract the horizontal lines ${L}_{hor}$ by applying the dilate function in OpenCV to ${I}_{hor}$.
- (8)
- Make the binary image ${I}_{vh}$ that represents the table frame by combining the vertical lines ${L}_{ver}$ and the horizontal lines ${L}_{hor}$ into one image, applying the addWeighted function in OpenCV to ${L}_{ver}$ and ${L}_{hor}$. Figure 7 shows the table frame image.
- (9)
- Obtain the contour information of each cell in the table and store them in the array ${L}_{cnts}$ by applying the findContours function in OpenCV to ${I}_{vh}$.
- (10)
- Find the rectangular ${r}_{i}$ for each contour ${l}_{i}$ of ${L}_{cnts}$ by applying the boundingRect function in OpenCV to every contour of ${L}_{cnts}$, and make the array R of ${r}_{i}$.
- (11)
- Sort ${r}_{i}$ in R from the leftmost one to the rightmost one according to the coordinates of ${r}_{i}$ by applying the zip method and the sort method in Python to R, and make the array ${L}_{cnts1}$ and the array ${R}_{1}$.
- (12)
- Remove all excessively large or small rectangles in ${R}_{1}$, to avoid rectangles that contain multiple cells or have no cells. Produce the ${R}_{2}$ with every rectangle containing a cell.
- (13)
- Create Cell. Create class ${c}_{i}$ for each rectangle in ${R}_{2}$, collect ${x}_{i}$, ${y}_{i}$, and ${w}_{i}$ of each rectangle in ${R}_{2}$, and put them into ${c}_{i}$ as member variables.
- (14)
- Obtain the character recognition result $nam{e}_{i}$ by applying the imagetostring function in pytesseract to each cell.
- (15)
- Sort the ${c}_{i}$, in which the ${x}_{i}$ and ${y}_{i}$ of each ${c}_{i}$ are arranged from left to right and from top to bottom. Arrange ${c}_{i}$ with similar ${x}_{i}$ into a column in the order of ${y}_{i}$ from small to large to output $ro{w}_{ci}$
- (16)
- Create the multi-tree class $Tree$ to describe the structure of the table with the root node $Root$. The child nodes of this root node are all ${c}_{i}$ with a row equal to 1.
- (17)
- Confirm the child node of $row{2}_{ci}$, if $row{1}_{ci}$’s ${h}_{i}$ is approximately equal to $row{2}_{ci}$’s ${h}_{i}$, the child node of $row{1}_{ci}$ is $row{2}_{ci}$. Otherwise, let $row{2}_{ci}$’s ${h}_{i}$ plus $row{2}_{ci+1}$’s ${h}_{i+1}$, and make the sum of $row{2}_{ci}$’s ${h}_{i}$ and $row{2}_{ci+1}$’s ${h}_{i+1}$ compare with $row{1}_{ci}$’s ${h}_{i}$. If the two are approximately equal, the child node of $row{1}_{ci}$ is $row{2}_{ci}$ and $row{2}_{ci+1}$. If the two are not approximately equal, it can be done again in the above way.
- (18)
- Follow the previous step to get the child node of $row{2}_{ci}$, the child node of $row{2}_{ci}$, up to the last column, and obtain the tree $Tree$ composed of ${c}_{i}$, which represents the structure of the table cell structure. Figure 8 shows the tree.
- (19)
- Preorder the $Tree$ and put each ${c}_{i}$’s $nam{e}_{i}$ in the leaf node’s $nam{e}_{i}$ in the order from the root node to the leaf node. Each leaf node’s $nam{e}_{i}$ contains one row of the table, and then output each leaf node’s $nam{e}_{i}$ in order to get the list of table information.
- (20)
- Convert the list to the Excel table E using openpyxl package and output it. Figure 9 shows this Excel file.
- (21)
- Merge the cells corresponding to the same node in the tree in step (18) into one cell in the Excel table E using the openpyxl package, and output it. Figure 10 shows the final output Excel file.

## 6. Evaluations

#### 6.1. Accuracy Index

#### 6.2. Character Recognition Accuracy Results

#### 6.2.1. Table Image at 2

#### 6.2.2. Reasons for Table Image at 5

#### 6.2.3. Reasons for Table Image at 10

#### 6.2.4. Reasons for Table Image at 16

#### 6.2.5. Reasons for Table Image at 18

#### 6.3. Conclusion of This Section

- When a cell size is small, it often recognizes the border as a character.
- It is difficult to correctly recognize vertical texts, subscripts, and Greek letters.
- It is more difficult to recognize a scanned image than a screenshot because it is usually blurry and has more noise.

#### 6.4. Comparison with Microsoft Software

#### 6.5. Comparison with Nanonets Software

## 7. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- What is Digital Transformation (DX)? Available online: https://www.netapp.com/devops-solutions/what-is-digital-transformation/ (accessed on 2 July 2022).
- Ohta, M.; Yamada, R.; Kanazawa, T.; Takasu, A. A cell-detection-based table-structure recognition method. In Proceedings of the ACM Symposium on Document Engineering 2019, Berlin, Germany, 23–26 September 2019. [Google Scholar]
- Phan, H.H.; Dai Duong, N. An Integrated Approach for Table Detection and Structure Recognition. J. Inf. Technol. Commun.
**2021**, 1, 41–50. [Google Scholar] [CrossRef] - Sahoo, R.; Kathale, C.; Kubal, M.; Malik, S. Auto-Table-Extract: A System To Identify And Extract Tables From PDF To Excel. Int. J. Sci. Technol. Res.
**2020**, 9, 217. [Google Scholar] - Amitha, S.; Mithun, M.; Chandana, P.C.; Borkakoty, M.; Adithya, U. Conversion of Image To Excel Using Ocr Technique. Int. Res. J. Mod. Eng. Technol. Sci.
**2020**, 4, 3743–3747. [Google Scholar] - Nagy, G.; Seth, S. Table headers: An entrance to the data mine. In Proceedings of the 2016 IEEE 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016. [Google Scholar]
- Li, M.; Cui, L.; Huang, S.; Wei, F.; Zhou, M.; Li, Z. Tablebank: A benchmark dataset for table detection and recognition. arXiv
**2019**, arXiv:1903.01949. [Google Scholar] - Pulli, K.; Baksheev, A.; Kornyakov, K.; Eruhimov, V. Realtime Computer Vision with OpenCV. Queue
**2012**, 10, 40–56. [Google Scholar] [CrossRef] - A Table Detection, Cell Recognition and Text Extraction Algorithm to Convert Tables in Images to Excel Files. Available online: https://towardsdatascience.com/ (accessed on 10 July 2022).
- How to Convert a Table of Paper Data to Excel Data? We Can Scan Tables by Use Office Application in Smartphone. Available online: https://dekiru.net/article/21950/ (accessed on 10 September 2022).
- Scan and Edit a Document. Available online: https://support.microsoft.com/en-us/office/scan-and-edit-a-document-7a07a4bd-aca5-4ec5-ba73-4589ac8b9eed (accessed on 10 September 2022).
- Table Detection, Table Extraction & Information Extraction Using DL. Available online: https://nanonets.com/blog/table-extraction-deep-learning/ (accessed on 10 September 2022).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Shi, C.; Funabiki, N.; Huo, Y.; Mentari, M.; Suga, K.; Toshida, T.
A Proposal of Printed Table Digitization Algorithm with Image Processing. *Algorithms* **2022**, *15*, 471.
https://doi.org/10.3390/a15120471

**AMA Style**

Shi C, Funabiki N, Huo Y, Mentari M, Suga K, Toshida T.
A Proposal of Printed Table Digitization Algorithm with Image Processing. *Algorithms*. 2022; 15(12):471.
https://doi.org/10.3390/a15120471

**Chicago/Turabian Style**

Shi, Chenrui, Nobuo Funabiki, Yuanzhi Huo, Mustika Mentari, Kohei Suga, and Takashi Toshida.
2022. "A Proposal of Printed Table Digitization Algorithm with Image Processing" *Algorithms* 15, no. 12: 471.
https://doi.org/10.3390/a15120471