Fine-Hybrid: Integration of BM25 And Finetuned SBERT to Enhance Search Relevance

Legal information retrieval, particularly for tax law documents, faces significant challenges due to specialized terminology, complex hierarchical structures, and formal language patterns that existing search approaches inadequately address. Current methods either rely on lexical matching or use ge...

Full description

Saved in:
Bibliographic Details
Main Authors: Wan Ahmad Gazali Kodri, Muhammad Haris, Rifqi Fitriadi
Format: Article
Language:English
Published: Center for Research and Community Service, Institut Informatika Indonesia Surabaya 2025-07-01
Series:Teknika
Subjects:
Online Access:https://ejournal.ikado.ac.id/index.php/teknika/article/view/1229
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849424688397156352
author Wan Ahmad Gazali Kodri
Muhammad Haris
Rifqi Fitriadi
author_facet Wan Ahmad Gazali Kodri
Muhammad Haris
Rifqi Fitriadi
author_sort Wan Ahmad Gazali Kodri
collection DOAJ
description Legal information retrieval, particularly for tax law documents, faces significant challenges due to specialized terminology, complex hierarchical structures, and formal language patterns that existing search approaches inadequately address. Current methods either rely on lexical matching or use general semantic models, creating a critical gap in effectively retrieving relevant tax law information. This research develops a novel hybrid search system to enhance search result relevance for the General Provisions and Tax Procedures (KUP) dataset by integrating a lexical-based search method (BM25) with semantic search using Sentence-BERT (SBERT) that has been fine-tuned using a taxation corpus. Our methodology encompasses several innovative components: development of synthetic data using a two-stage LLM prompting approach for SBERT fine-tuning, implementation of a comprehensive query normalization system with taxation-specific terminology mapping, and integration of lexical and semantic results through Reciprocal Rank Fusion (RRF). We evaluate system performance with inputs from tax domain experts, demonstrating that the Fine-hybrid model consistently outperforms individual search methods, achieving a Precision@N of 66.021% and Average Recall of 76.51%. Our approach addresses the specific challenges of tax document retrieval while providing a generalizable framework applicable to other specialized domains with similar characteristics. This research contributes both theoretical advancements in hybrid search methodologies for legal documents and practical solutions for improving tax information accessibility, with implications for enhancing administrative efficiency and taxpayer compliance.
format Article
id doaj-art-62fec3c2eb6c4d0989a930f10ffd5e1d
institution Kabale University
issn 2549-8037
2549-8045
language English
publishDate 2025-07-01
publisher Center for Research and Community Service, Institut Informatika Indonesia Surabaya
record_format Article
series Teknika
spelling doaj-art-62fec3c2eb6c4d0989a930f10ffd5e1d2025-08-20T03:30:03ZengCenter for Research and Community Service, Institut Informatika Indonesia SurabayaTeknika2549-80372549-80452025-07-0114210.34148/teknika.v14i2.1229Fine-Hybrid: Integration of BM25 And Finetuned SBERT to Enhance Search RelevanceWan Ahmad Gazali KodriMuhammad HarisRifqi Fitriadi Legal information retrieval, particularly for tax law documents, faces significant challenges due to specialized terminology, complex hierarchical structures, and formal language patterns that existing search approaches inadequately address. Current methods either rely on lexical matching or use general semantic models, creating a critical gap in effectively retrieving relevant tax law information. This research develops a novel hybrid search system to enhance search result relevance for the General Provisions and Tax Procedures (KUP) dataset by integrating a lexical-based search method (BM25) with semantic search using Sentence-BERT (SBERT) that has been fine-tuned using a taxation corpus. Our methodology encompasses several innovative components: development of synthetic data using a two-stage LLM prompting approach for SBERT fine-tuning, implementation of a comprehensive query normalization system with taxation-specific terminology mapping, and integration of lexical and semantic results through Reciprocal Rank Fusion (RRF). We evaluate system performance with inputs from tax domain experts, demonstrating that the Fine-hybrid model consistently outperforms individual search methods, achieving a Precision@N of 66.021% and Average Recall of 76.51%. Our approach addresses the specific challenges of tax document retrieval while providing a generalizable framework applicable to other specialized domains with similar characteristics. This research contributes both theoretical advancements in hybrid search methodologies for legal documents and practical solutions for improving tax information accessibility, with implications for enhancing administrative efficiency and taxpayer compliance. https://ejournal.ikado.ac.id/index.php/teknika/article/view/1229Hybrid SearchBM25SBERTRRFGenerative AI
spellingShingle Wan Ahmad Gazali Kodri
Muhammad Haris
Rifqi Fitriadi
Fine-Hybrid: Integration of BM25 And Finetuned SBERT to Enhance Search Relevance
Teknika
Hybrid Search
BM25
SBERT
RRF
Generative AI
title Fine-Hybrid: Integration of BM25 And Finetuned SBERT to Enhance Search Relevance
title_full Fine-Hybrid: Integration of BM25 And Finetuned SBERT to Enhance Search Relevance
title_fullStr Fine-Hybrid: Integration of BM25 And Finetuned SBERT to Enhance Search Relevance
title_full_unstemmed Fine-Hybrid: Integration of BM25 And Finetuned SBERT to Enhance Search Relevance
title_short Fine-Hybrid: Integration of BM25 And Finetuned SBERT to Enhance Search Relevance
title_sort fine hybrid integration of bm25 and finetuned sbert to enhance search relevance
topic Hybrid Search
BM25
SBERT
RRF
Generative AI
url https://ejournal.ikado.ac.id/index.php/teknika/article/view/1229
work_keys_str_mv AT wanahmadgazalikodri finehybridintegrationofbm25andfinetunedsberttoenhancesearchrelevance
AT muhammadharis finehybridintegrationofbm25andfinetunedsberttoenhancesearchrelevance
AT rifqifitriadi finehybridintegrationofbm25andfinetunedsberttoenhancesearchrelevance