End-to-end data extraction framework from unstructured geotechnical investigation reports via integrated deep learning and text mining techniques
Geotechnical investigation reports have been generated for infrastructure projects prior to construction, typically in inconsistent and unstructured formats containing engineering properties in figures, tables, and other subcomponents. However, extracting specific information requires manual review,...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-10-01
|
| Series: | Developments in the Built Environment |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2666165925001334 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Geotechnical investigation reports have been generated for infrastructure projects prior to construction, typically in inconsistent and unstructured formats containing engineering properties in figures, tables, and other subcomponents. However, extracting specific information requires manual review, which is time-consuming and prone to human error. This study proposed an automated framework that converts unstructured geotechnical reports into structured digital databases, leveraging artificial intelligence, text mining techniques, and rule-based algorithms. The framework begins with page classification using a hybrid approach combining a convolutional neural network and a text mining algorithm, followed by page layout analysis to determine components such as title, text, table, and figure. Based on the layout, systematic rule-based data extraction generates structured databases, which enhances data flexibility and further applications in practice. The proposed framework efficiently extracts data from the test set within seconds without errors. It can be extended to other unstructured engineering documents, enhancing data-driven processes in construction projects. |
|---|---|
| ISSN: | 2666-1659 |