FaN-REMs: Fair and Normalized Retrieval Evaluation Metrics for Learning Retrieval Systems
Retrieval evaluation metrics are vital for resilient artificial intelligence (AI) retrieval systems and its subfields, such as case-based reasoning (CBR). Despite extensive research in CBR over the decades, the field still lacks a specialized retrieval evaluation metric (REM). This study aims to cri...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10794782/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Retrieval evaluation metrics are vital for resilient artificial intelligence (AI) retrieval systems and its subfields, such as case-based reasoning (CBR). Despite extensive research in CBR over the decades, the field still lacks a specialized retrieval evaluation metric (REM). This study aims to critically investigate and devise retrieval evaluation metrics that are generic, fair, normalized, and non-deceptive, which make it suitable for domains including learning retrieval systems such as CBR. It focuses on enhancing CBR retrievals by addressing key flaws in widely used metrics, including the normalized discounted cumulative gain at k (<inline-formula> <tex-math notation="LaTeX">$nDCG\text{@}k$ </tex-math></inline-formula>) metric in IR. The proposed method entails devising a fair and normalized (FaN) relevancy metric for a case retrieved against a test query, blending system-generated and oracle-assessed relevancies. This underpins three rank-based metrics: relevancy at k (<inline-formula> <tex-math notation="LaTeX">$R\text{@}k$ </tex-math></inline-formula>), average relevancy at k (<inline-formula> <tex-math notation="LaTeX">$AR\text{@}k$ </tex-math></inline-formula>), and mean average relevancy at k (<inline-formula> <tex-math notation="LaTeX">$MAR\text{@}k$ </tex-math></inline-formula>). These metrics are designed for both single and multiple query evaluations and offer a comprehensive retrieval analysis. Despite inherent challenges in evaluating evaluation metrics, FaN-REMs demonstrated robust performance across plausible domain values for the FaN relevancy function. These metrics effectively assess retrievals across different implementations, similarity measures, and applications, with inherent normalization allowing for comparisons across heterogeneous systems. These metrics were instrumental in developing a CBR-based clinical decision support system (CDSS) for the SupportPrim study in Norway, demonstrating the practical application and relevance of this research in real-world AI systems. FaN-REMs show promise as benchmark metrics to compare various retrieval and CBR systems. Suitable for set-based and rank-based evaluations, rank-based FaN-REMs demonstrate superior discriminatory capability. The experimental results affirm the viability of FaN-REMs in real-world CBR system development and maintenance. |
|---|---|
| ISSN: | 2169-3536 |