End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques

Geotechnical investigation reports have been generated for infrastructure projects prior to construction, typically in inconsistent and unstructured formats containing engineering properties in figures, tables, and other subcomponents. However, extracting specific information requires manual review,...

Full description

Saved in:
Bibliographic Details
Main Authors: Jimin Park, Wanhyuk Seo, Tae Sup Yun
Format: Article
Language:English
Published: Elsevier 2025-10-01
Series:Developments in the Built Environment
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666165925001334
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Geotechnical investigation reports have been generated for infrastructure projects prior to construction, typically in inconsistent and unstructured formats containing engineering properties in figures, tables, and other subcomponents. However, extracting specific information requires manual review, which is time-consuming and prone to human error. This study proposed an automated framework that converts unstructured geotechnical reports into structured digital databases, leveraging artificial intelligence, text mining techniques, and rule-based algorithms. The framework begins with page classification using a hybrid approach combining a convolutional neural network and a text mining algorithm, followed by page layout analysis to determine components such as title, text, table, and figure. Based on the layout, systematic rule-based data extraction generates structured databases, which enhances data flexibility and further applications in practice. The proposed framework efficiently extracts data from the test set within seconds without errors. It can be extended to other unstructured engineering documents, enhancing data-driven processes in construction projects.
ISSN:2666-1659