Semantic Schema Extraction in NoSQL Databases using BERT Embeddings

NoSQL databases, valued for flexibility and scalability, pose analytics challenges due to their schema-less nature. Automatic schema extraction is crucial, with existing techniques limited in handling nested structures. Leveraging Natural Language Processing (NLP) advancements, this paper introduces...

Full description

Saved in:
Bibliographic Details
Main Authors: Saad Belefqih, Ahmed Zellou, Mouna Berquedich
Format: Article
Language:English
Published: Ubiquity Press 2024-12-01
Series:Data Science Journal
Subjects:
Online Access:https://account.datascience.codata.org/index.php/up-j-dsj/article/view/1688
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:NoSQL databases, valued for flexibility and scalability, pose analytics challenges due to their schema-less nature. Automatic schema extraction is crucial, with existing techniques limited in handling nested structures. Leveraging Natural Language Processing (NLP) advancements, this paper introduces a novel BERT Embeddings-Based approach for extracting schemas from NoSQL databases. The method analyzes semantic relationships within triplets from JSON documents through four stages: triplet extraction, preprocessing, BERT Embedding generation, and similarity analysis. Evaluation on real datasets demonstrates over 83% accuracy in extracting valid nested schema components. The study reveals interdisciplinary intersections, using NLP to unveil structures in scenarios lacking explicit schemas, showcasing significant potential for autonomous schema extraction from raw, unstructured data formats.
ISSN:1683-1470