It is divided into two main sections: constructing the RAG with LlamaIndex and assessing its performance using a combination of LlamaIndex and Langtest. Our primary focus is on the evaluation as a pivotal metric for gauging the accuracy and reliability of the RAG application across a variety of queries and data sources.
To measure the effectiveness of the retriever component, we introduce LangtestRetrieverEvaluator utilizing two key metrics: Hit Rate
and Mean Reciprocal Rank (MRR)
. Hit Rate assesses the percentage of queries where the correct answer is found within the top-k retrieved documents, essentially measuring the retriever’s precision. Mean Reciprocal Rank, on the other hand, evaluates the rank of the highest-placed relevant document across all queries, providing a nuanced view of the retriever’s accuracy. We use LlamaIndex’s generate_question_context_pairs module for creating relevant question and context pairs, serving as the foundation for both retrieval and response evaluation phases in the RAG system.
The evaluation process employs a dual approach by assessing the system with both standard and perturbed queries generated through Langtest. This methodology ensures a thorough understanding of the retriever’s robustness and adaptability under various conditions, reflecting its performance in real-world scenarios.
from langtest.evaluation import LangtestRetrieverEvaluator
retriever_evaluator = LangtestRetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"],
retriever=retriever
)
retriever_evaluator.setPerturbations("add_typo","dyslexia_word_swap", "add_ocr_typo")
# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
retriever_evaluator.display_results()