Benchmark Dataset Notebooks

The following table provides an overview of various Benchmark Dataset notebooks. User can select from a list of benchmark datasets provided below to test their LLMs.

Tutorial Description	Hub	Task
OpenbookQA: Evaluate your model’s ability to answer questions that require complex reasoning and inference based on general knowledge, similar to an “open-book” exam.	OpenAI	Question-Answering
Quac : Evaluate your model’s ability to answer questions given a conversational context, focusing on dialogue-based question-answering.	OpenAI	Question-Answering
MMLU : Evaluate language understanding models’ performance in different domains. It covers 57 subjects across STEM, the humanities, the social sciences, and more.	OpenAI	Question-Answering
TruthfulQA: Evaluate the model’s capability to answer questions accurately and truthfully based on the provided information.	OpenAI	Question-Answering
NarrativeQA: Evaluate your model’s ability to comprehend and answer questions about long and complex narratives, such as stories or articles.	OpenAI	Question-Answering
HellaSWag: Evaluate your model’s ability in completions of sentences.	OpenAI	Question-Answering
BBQ: Evaluate how your model responds to questions in the presence of social biases against protected classes across various social dimensions.	OpenAI	Question-Answering
NQ open: Evaluate the ability of your model to answer open-ended questions based on a given passage or context.	OpenAI	Question-Answering
BoolQ: Evaluate the ability of your model to answer boolean questions (yes/no) based on a given passage or context.	OpenAI	Question-Answering
XSum : Evaluate your model’s ability to generate concise and informative summaries for long articles with the XSum dataset.	OpenAI	Summarization
LogiQA: Evaluate your model’s accuracy on Machine Reading Comprehension with Logical Reasoning questions.	OpenAI	Question-Answering
ASDiv: Evaluate your model’s ability answer questions based on Math Word Problems.	OpenAI	Question-Answering
BigBench: Evaluate your model’s performance on BigBench datasets by Google.	OpenAI	Question-Answering
MultiLexSum: Evaluate your model’s ability to generate concise and informative summaries for legal case contexts from the Multi-LexSum dataset	OpenAI	Summarization
Legal-QA: Evaluate your model’s performance on legal-qa datasets	OpenAI	Legal-Tests
CommonSenseQA: Evaluate your model’s performance on the CommonsenseQA dataset	OpenAI	Question-Answering
SIQA: Evaluate your model’s performance by assessing its accuracy in understanding social situations.	OpenAI	Question-Answering
PIQA: Evaluate your model’s performance on the PIQA dataset, which tests its ability to reason about everyday physical situations	OpenAI	Question-Answering
Fiqa: Evaluate your model’s performance on the FiQA dataset, a comprehensive and specialized resource designed for finance-related question-answering tasks	OpenAI	Question-Answering
LiveQA: Evaluate your model’s performance on the Medical Question Answering Task at TREC 2017 LiveQA	OpenAI	Question-Answering
MedicationQA: Evaluate your model’s performance on the MedicationQA dataset, Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers	OpenAI	Question-Answering
HealthSearchQA: Evaluate your model’s performance on the HealthSearchQA dataset, a Large Language Models Encode Clinical Knowledge question-answering tasks	OpenAI	Question-Answering