End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques
Geotechnical investigation reports have been generated for infrastructure projects prior to construction, typically in inconsistent and unstructured formats containing engineering properties in figures, tables, and other subcomponents. However, extracting specific information requires manual review,...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-10-01
|
| Series: | Developments in the Built Environment |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2666165925001334 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849393509321146368 |
|---|---|
| author | Jimin Park Wanhyuk Seo Tae Sup Yun |
| author_facet | Jimin Park Wanhyuk Seo Tae Sup Yun |
| author_sort | Jimin Park |
| collection | DOAJ |
| description | Geotechnical investigation reports have been generated for infrastructure projects prior to construction, typically in inconsistent and unstructured formats containing engineering properties in figures, tables, and other subcomponents. However, extracting specific information requires manual review, which is time-consuming and prone to human error. This study proposed an automated framework that converts unstructured geotechnical reports into structured digital databases, leveraging artificial intelligence, text mining techniques, and rule-based algorithms. The framework begins with page classification using a hybrid approach combining a convolutional neural network and a text mining algorithm, followed by page layout analysis to determine components such as title, text, table, and figure. Based on the layout, systematic rule-based data extraction generates structured databases, which enhances data flexibility and further applications in practice. The proposed framework efficiently extracts data from the test set within seconds without errors. It can be extended to other unstructured engineering documents, enhancing data-driven processes in construction projects. |
| format | Article |
| id | doaj-art-3fdf5180a7c6495b9ddf2804801b2d8d |
| institution | Kabale University |
| issn | 2666-1659 |
| language | English |
| publishDate | 2025-10-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Developments in the Built Environment |
| spelling | doaj-art-3fdf5180a7c6495b9ddf2804801b2d8d2025-08-20T03:40:24ZengElsevierDevelopments in the Built Environment2666-16592025-10-012310073310.1016/j.dibe.2025.100733End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniquesJimin Park0Wanhyuk Seo1Tae Sup Yun2School of Civil and Environmental Engineering, Yonsei University, Seoul, 03722, Republic of KoreaSchool of Civil and Environmental Engineering, Yonsei University, Seoul, 03722, Republic of KoreaCorresponding author.; School of Civil and Environmental Engineering, Yonsei University, Seoul, 03722, Republic of KoreaGeotechnical investigation reports have been generated for infrastructure projects prior to construction, typically in inconsistent and unstructured formats containing engineering properties in figures, tables, and other subcomponents. However, extracting specific information requires manual review, which is time-consuming and prone to human error. This study proposed an automated framework that converts unstructured geotechnical reports into structured digital databases, leveraging artificial intelligence, text mining techniques, and rule-based algorithms. The framework begins with page classification using a hybrid approach combining a convolutional neural network and a text mining algorithm, followed by page layout analysis to determine components such as title, text, table, and figure. Based on the layout, systematic rule-based data extraction generates structured databases, which enhances data flexibility and further applications in practice. The proposed framework efficiently extracts data from the test set within seconds without errors. It can be extended to other unstructured engineering documents, enhancing data-driven processes in construction projects.http://www.sciencedirect.com/science/article/pii/S2666165925001334Data extractionGeotechnical investigation reportConvolutional neural networkText mining techniqueData managementDigitization |
| spellingShingle | Jimin Park Wanhyuk Seo Tae Sup Yun End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques Developments in the Built Environment Data extraction Geotechnical investigation report Convolutional neural network Text mining technique Data management Digitization |
| title | End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques |
| title_full | End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques |
| title_fullStr | End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques |
| title_full_unstemmed | End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques |
| title_short | End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques |
| title_sort | end to end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques |
| topic | Data extraction Geotechnical investigation report Convolutional neural network Text mining technique Data management Digitization |
| url | http://www.sciencedirect.com/science/article/pii/S2666165925001334 |
| work_keys_str_mv | AT jiminpark endtoenddataextractionframeworkfromunstructuredgeotechnicalinvestigationreportsviaintegrateddeeplearningandtextminingtechniques AT wanhyukseo endtoenddataextractionframeworkfromunstructuredgeotechnicalinvestigationreportsviaintegrateddeeplearningandtextminingtechniques AT taesupyun endtoenddataextractionframeworkfromunstructuredgeotechnicalinvestigationreportsviaintegrateddeeplearningandtextminingtechniques |