An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain

Large language model (LLM) processing, with natural language as its core, carries out information retrieval through intelligent Q&A. It has a wide range of application scenarios and is commonly considered a kind of generative AI. However, when LLMs handle generation tasks, the results generated...

Full description

Saved in:

Bibliographic Details
Main Authors:	Qi Chen, Weifeng Zhou, Jian Cheng, Ji Yang
Format:	Article
Language:	English
Published:	MDPI AG 2024-12-01
Series:	Applied Sciences
Subjects:	large language model information retrieval BM25 retrieval-augmented generation
Online Access:	https://www.mdpi.com/2076-3417/14/24/11529
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846106109566255104
author	Qi Chen Weifeng Zhou Jian Cheng Ji Yang
author_facet	Qi Chen Weifeng Zhou Jian Cheng Ji Yang
author_sort	Qi Chen
collection	DOAJ
description	Large language model (LLM) processing, with natural language as its core, carries out information retrieval through intelligent Q&A. It has a wide range of application scenarios and is commonly considered a kind of generative AI. However, when LLMs handle generation tasks, the results generated by fundamental LLMs with an insufficient comprehensive performance, specifically in the vertical domain, are often inaccurate due to a poor generalization ability, resulting in the so-called “hallucination” phenomenon. To solve these problems, in this study, an enhanced retrieval scheme for LLM processing was developed, named the BM-RAGAM (BM25 retrieval-augmented generation attention mechanism), by constructing a vectorized knowledge base, utilizing a hybrid joint retrieval strategy of keyword matching through searching and a semantic-enhanced association with an attention mechanism and taking ocean-front- and eddy-related knowledge in oceanography as an example. This scheme realized accurate word-based matching with the BM25 algorithm and text generation through a semantic-enhanced association using RAG, and it was used to construct a vector database of the text knowledge on ocean fronts and eddies. The output was compared and analyzed with the fundamental LLM of Qwen2-72B using the proposed scheme, and an ablation experiment was conducted. The results show that the proposed scheme greatly reduced hallucination generation in the process of text generation, making its outputs more interpretable.
format	Article
id	doaj-art-0fdf17e17df84dd6bb7cada25ff4023f
institution	Kabale University
issn	2076-3417
language	English
publishDate	2024-12-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj-art-0fdf17e17df84dd6bb7cada25ff4023f2024-12-27T14:07:33ZengMDPI AGApplied Sciences2076-34172024-12-0114241152910.3390/app142411529An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical DomainQi Chen0Weifeng Zhou1Jian Cheng2Ji Yang3School of Information Engineering, Zhejiang Ocean University, Zhoushan 316022, ChinaEast China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Shanghai 200090, ChinaSchool of Information Engineering, Zhejiang Ocean University, Zhoushan 316022, ChinaSchool of Information Engineering, Zhejiang Ocean University, Zhoushan 316022, ChinaLarge language model (LLM) processing, with natural language as its core, carries out information retrieval through intelligent Q&A. It has a wide range of application scenarios and is commonly considered a kind of generative AI. However, when LLMs handle generation tasks, the results generated by fundamental LLMs with an insufficient comprehensive performance, specifically in the vertical domain, are often inaccurate due to a poor generalization ability, resulting in the so-called “hallucination” phenomenon. To solve these problems, in this study, an enhanced retrieval scheme for LLM processing was developed, named the BM-RAGAM (BM25 retrieval-augmented generation attention mechanism), by constructing a vectorized knowledge base, utilizing a hybrid joint retrieval strategy of keyword matching through searching and a semantic-enhanced association with an attention mechanism and taking ocean-front- and eddy-related knowledge in oceanography as an example. This scheme realized accurate word-based matching with the BM25 algorithm and text generation through a semantic-enhanced association using RAG, and it was used to construct a vector database of the text knowledge on ocean fronts and eddies. The output was compared and analyzed with the fundamental LLM of Qwen2-72B using the proposed scheme, and an ablation experiment was conducted. The results show that the proposed scheme greatly reduced hallucination generation in the process of text generation, making its outputs more interpretable.https://www.mdpi.com/2076-3417/14/24/11529large language modelinformation retrievalBM25retrieval-augmented generation
spellingShingle	Qi Chen Weifeng Zhou Jian Cheng Ji Yang An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain Applied Sciences large language model information retrieval BM25 retrieval-augmented generation
title	An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain
title_full	An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain
title_fullStr	An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain
title_full_unstemmed	An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain
title_short	An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain
title_sort	enhanced retrieval scheme for a large language model with a joint strategy of probabilistic relevance and semantic association in the vertical domain
topic	large language model information retrieval BM25 retrieval-augmented generation
url	https://www.mdpi.com/2076-3417/14/24/11529
work_keys_str_mv	AT qichen anenhancedretrievalschemeforalargelanguagemodelwithajointstrategyofprobabilisticrelevanceandsemanticassociationintheverticaldomain AT weifengzhou anenhancedretrievalschemeforalargelanguagemodelwithajointstrategyofprobabilisticrelevanceandsemanticassociationintheverticaldomain AT jiancheng anenhancedretrievalschemeforalargelanguagemodelwithajointstrategyofprobabilisticrelevanceandsemanticassociationintheverticaldomain AT jiyang anenhancedretrievalschemeforalargelanguagemodelwithajointstrategyofprobabilisticrelevanceandsemanticassociationintheverticaldomain AT qichen enhancedretrievalschemeforalargelanguagemodelwithajointstrategyofprobabilisticrelevanceandsemanticassociationintheverticaldomain AT weifengzhou enhancedretrievalschemeforalargelanguagemodelwithajointstrategyofprobabilisticrelevanceandsemanticassociationintheverticaldomain AT jiancheng enhancedretrievalschemeforalargelanguagemodelwithajointstrategyofprobabilisticrelevanceandsemanticassociationintheverticaldomain AT jiyang enhancedretrievalschemeforalargelanguagemodelwithajointstrategyofprobabilisticrelevanceandsemanticassociationintheverticaldomain

An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain

Similar Items