TableStructureFormer: an improved masked-attention mask transformer model with long-distance feature aggregation and deep detail supervision for table structure recognition

Abstract Table structure recognition, as the most important subtask in image-based table recognition, recognizes the logical relationship between adjacent cells and represents the structured essence of the table. However, in reality, tables often appear in many different styles, such as those contai...

Full description

Saved in:

Bibliographic Details
Main Authors:	Chenglong Yu, Weibin Li, Zixuan Zhu, Wei Li, Jianchao Du, Shiwei Zhang
Format:	Article
Language:	English
Published:	Springer 2025-06-01
Series:	Complex & Intelligent Systems
Subjects:	Image-based table recognition Table structure recognition Row–column segmentation Masked-attention mask transformer Dual-path adaptive weighted attention Deep detail supervision
Online Access:	https://doi.org/10.1007/s40747-025-01975-w
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Table structure recognition, as the most important subtask in image-based table recognition, recognizes the logical relationship between adjacent cells and represents the structured essence of the table. However, in reality, tables often appear in many different styles, such as those containing many blank cells, large row–column spans, missing separator lines, or even no separator lines at all. Different styles present great challenges to table structure recognition. In response to the above situation, we carefully analyzed the structural characteristics of the table and believed that rows and columns are their inherent characteristics. However, row–column annotations are rarely encountered in existing publicly available datasets, and cells with row–column spans are often ignored. Thus, we propose an innovative table structure annotation scheme, where the objects to be annotated include rows, columns, and cells with row–column spans. Furthermore, we released a challenging dataset named Row–Column Segmentation for Table Structure Recognition (RCSTSR), which contains more than 12,000 table images of different styles, each of which is notated with the corresponding mask. Then, on the basis of this dataset, we construct an effective semantic segmentation-based solution for table structure recognition. It consists of two main parts: an improved masked-attention mask transformer model, named TableStructureFormer, and corresponding postprocessing. Among them, the former is responsible for predicting the masks of objects in the table image, and the latter is used to generate the table structure on the basis of the predicted masks. Considering that long-distance feature maps are more useful than local feature maps for row–column segmentation, we propose the dual-path adaptive weighted attention module to aggregate the multilevel long-distance feature maps and adaptively select more input information by introducing enhanced strip pooling and learnable weighted parameters, thereby improving the segmentation performance. In addition, to address the difficulty of perfectly segmenting the details of rows and columns, we propose the deep detail supervision module to guide the segmentation model to learn the detailed feature maps about objects, thereby further correcting their masks. The experimental results show that for row–column segmentation, the mean intersection over union (mIoU) values of TableStructureFormer are 92.15% and 92.46%, respectively, which are significantly greater than those of the existing segmentation models. For table structure recognition, it can also accurately generate table structures according to the segmentation results. The tree edit distance similarity for the table structure (TEDS-Struct), precision, and recall are 91.42%, 90.96%, and 87.73%, respectively, indicating excellent table logic understanding ability. When the position matching thresholds are 0.50 and 0.75, the average accuracy (AP) is 91.50% and 86.32%, respectively, indicating excellent spatial perception ability. In summary, the proposed TableStructureFormer can accurately and robustly implement row–column segmentation and table structure recognition, even when facing different styles of tables.
ISSN:	2199-4536 2198-6053

TableStructureFormer: an improved masked-attention mask transformer model with long-distance feature aggregation and deep detail supervision for table structure recognition

Similar Items