The following table provides an overview of various Benchmark Dataset notebooks. User can select from a list of benchmark datasets provided below to test their LLMs.
Tutorial Description | Hub | Task | Open In Colab |
---|---|---|---|
OpenbookQA: Evaluate your model’s ability to answer questions that require complex reasoning and inference based on general knowledge, similar to an “open-book” exam. | OpenAI | Question-Answering | |
Quac : Evaluate your model’s ability to answer questions given a conversational context, focusing on dialogue-based question-answering. | OpenAI | Question-Answering | |
MMLU : Evaluate language understanding models’ performance in different domains. It covers 57 subjects across STEM, the humanities, the social sciences, and more. | OpenAI | Question-Answering | |
TruthfulQA: Evaluate the model’s capability to answer questions accurately and truthfully based on the provided information. | OpenAI | Question-Answering | |
NarrativeQA: Evaluate your model’s ability to comprehend and answer questions about long and complex narratives, such as stories or articles. | OpenAI | Question-Answering | |
HellaSWag: Evaluate your model’s ability in completions of sentences. | OpenAI | Question-Answering | |
BBQ: Evaluate how your model responds to questions in the presence of social biases against protected classes across various social dimensions. | OpenAI | Question-Answering | |
NQ open: Evaluate the ability of your model to answer open-ended questions based on a given passage or context. | OpenAI | Question-Answering | |
BoolQ: Evaluate the ability of your model to answer boolean questions (yes/no) based on a given passage or context. | OpenAI | Question-Answering | |
XSum : Evaluate your model’s ability to generate concise and informative summaries for long articles with the XSum dataset. | OpenAI | Summarization | |
LogiQA: Evaluate your model’s accuracy on Machine Reading Comprehension with Logical Reasoning questions. | OpenAI | Question-Answering | |
ASDiv: Evaluate your model’s ability answer questions based on Math Word Problems. | OpenAI | Question-Answering | |
BigBench: Evaluate your model’s performance on BigBench datasets by Google. | OpenAI | Question-Answering | |
MultiLexSum: Evaluate your model’s ability to generate concise and informative summaries for legal case contexts from the Multi-LexSum dataset | OpenAI | Summarization | |
Legal-QA: Evaluate your model’s performance on legal-qa datasets | OpenAI | Legal-Tests | |
CommonSenseQA: Evaluate your model’s performance on the CommonsenseQA dataset | OpenAI | Question-Answering | |
SIQA: Evaluate your model’s performance by assessing its accuracy in understanding social situations. | OpenAI | Question-Answering | |
PIQA: Evaluate your model’s performance on the PIQA dataset, which tests its ability to reason about everyday physical situations | OpenAI | Question-Answering | |
Fiqa: Evaluate your model’s performance on the FiQA dataset, a comprehensive and specialized resource designed for finance-related question-answering tasks | OpenAI | Question-Answering | |
LiveQA: Evaluate your model’s performance on the Medical Question Answering Task at TREC 2017 LiveQA | OpenAI | Question-Answering | |
MedicationQA: Evaluate your model’s performance on the MedicationQA dataset, Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers | OpenAI | Question-Answering | |
HealthSearchQA: Evaluate your model’s performance on the HealthSearchQA dataset, a Large Language Models Encode Clinical Knowledge question-answering tasks | OpenAI | Question-Answering |