It is divided into two main sections: constructing the RAG with LlamaIndex and assessing its performance using a combination of LlamaIndex and Langtest. Our primary focus is on the evaluation as a pivotal metric for gauging the accuracy and reliability of the RAG application across a variety of queries and data sources.
To measure the effectiveness of the retriever component, we introduce LangtestRetrieverEvaluator utilizing two key metrics:
Hit Rate and
Mean Reciprocal Rank (MRR). Hit Rate assesses the percentage of queries where the correct answer is found within the top-k retrieved documents, essentially measuring the retriever’s precision. Mean Reciprocal Rank, on the other hand, evaluates the rank of the highest-placed relevant document across all queries, providing a nuanced view of the retriever’s accuracy. We use LlamaIndex’s generate_question_context_pairs module for creating relevant question and context pairs, serving as the foundation for both retrieval and response evaluation phases in the RAG system.
The evaluation process employs a dual approach by assessing the system with both standard and perturbed queries generated through Langtest. This methodology ensures a thorough understanding of the retriever’s robustness and adaptability under various conditions, reflecting its performance in real-world scenarios.
from langtest.evaluation import LangtestRetrieverEvaluator
retriever_evaluator = LangtestRetrieverEvaluator.from_metric_names(
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)