End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques

Geotechnical investigation reports have been generated for infrastructure projects prior to construction, typically in inconsistent and unstructured formats containing engineering properties in figures, tables, and other subcomponents. However, extracting specific information requires manual review,...

Full description

Saved in:
Bibliographic Details
Main Authors: Jimin Park, Wanhyuk Seo, Tae Sup Yun
Format: Article
Language:English
Published: Elsevier 2025-10-01
Series:Developments in the Built Environment
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666165925001334
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849393509321146368
author Jimin Park
Wanhyuk Seo
Tae Sup Yun
author_facet Jimin Park
Wanhyuk Seo
Tae Sup Yun
author_sort Jimin Park
collection DOAJ
description Geotechnical investigation reports have been generated for infrastructure projects prior to construction, typically in inconsistent and unstructured formats containing engineering properties in figures, tables, and other subcomponents. However, extracting specific information requires manual review, which is time-consuming and prone to human error. This study proposed an automated framework that converts unstructured geotechnical reports into structured digital databases, leveraging artificial intelligence, text mining techniques, and rule-based algorithms. The framework begins with page classification using a hybrid approach combining a convolutional neural network and a text mining algorithm, followed by page layout analysis to determine components such as title, text, table, and figure. Based on the layout, systematic rule-based data extraction generates structured databases, which enhances data flexibility and further applications in practice. The proposed framework efficiently extracts data from the test set within seconds without errors. It can be extended to other unstructured engineering documents, enhancing data-driven processes in construction projects.
format Article
id doaj-art-3fdf5180a7c6495b9ddf2804801b2d8d
institution Kabale University
issn 2666-1659
language English
publishDate 2025-10-01
publisher Elsevier
record_format Article
series Developments in the Built Environment
spelling doaj-art-3fdf5180a7c6495b9ddf2804801b2d8d2025-08-20T03:40:24ZengElsevierDevelopments in the Built Environment2666-16592025-10-012310073310.1016/j.dibe.2025.100733End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniquesJimin Park0Wanhyuk Seo1Tae Sup Yun2School of Civil and Environmental Engineering, Yonsei University, Seoul, 03722, Republic of KoreaSchool of Civil and Environmental Engineering, Yonsei University, Seoul, 03722, Republic of KoreaCorresponding author.; School of Civil and Environmental Engineering, Yonsei University, Seoul, 03722, Republic of KoreaGeotechnical investigation reports have been generated for infrastructure projects prior to construction, typically in inconsistent and unstructured formats containing engineering properties in figures, tables, and other subcomponents. However, extracting specific information requires manual review, which is time-consuming and prone to human error. This study proposed an automated framework that converts unstructured geotechnical reports into structured digital databases, leveraging artificial intelligence, text mining techniques, and rule-based algorithms. The framework begins with page classification using a hybrid approach combining a convolutional neural network and a text mining algorithm, followed by page layout analysis to determine components such as title, text, table, and figure. Based on the layout, systematic rule-based data extraction generates structured databases, which enhances data flexibility and further applications in practice. The proposed framework efficiently extracts data from the test set within seconds without errors. It can be extended to other unstructured engineering documents, enhancing data-driven processes in construction projects.http://www.sciencedirect.com/science/article/pii/S2666165925001334Data extractionGeotechnical investigation reportConvolutional neural networkText mining techniqueData managementDigitization
spellingShingle Jimin Park
Wanhyuk Seo
Tae Sup Yun
End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques
Developments in the Built Environment
Data extraction
Geotechnical investigation report
Convolutional neural network
Text mining technique
Data management
Digitization
title End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques
title_full End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques
title_fullStr End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques
title_full_unstemmed End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques
title_short End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques
title_sort end to end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques
topic Data extraction
Geotechnical investigation report
Convolutional neural network
Text mining technique
Data management
Digitization
url http://www.sciencedirect.com/science/article/pii/S2666165925001334
work_keys_str_mv AT jiminpark endtoenddataextractionframeworkfromunstructuredgeotechnicalinvestigationreportsviaintegrateddeeplearningandtextminingtechniques
AT wanhyukseo endtoenddataextractionframeworkfromunstructuredgeotechnicalinvestigationreportsviaintegrateddeeplearningandtextminingtechniques
AT taesupyun endtoenddataextractionframeworkfromunstructuredgeotechnicalinvestigationreportsviaintegrateddeeplearningandtextminingtechniques