The following table provides an overview of various Benchmark Dataset notebooks. User can select from a list of benchmark datasets provided below to test their LLMs.
| Tutorial Description | Hub | Task | Open In Colab |
|---|---|---|---|
| OpenbookQA: Evaluate your model’s ability to answer questions that require complex reasoning and inference based on general knowledge, similar to an “open-book” exam. | OpenAI | Question-Answering | |
| Quac : Evaluate your model’s ability to answer questions given a conversational context, focusing on dialogue-based question-answering. | OpenAI | Question-Answering | |
| MMLU : Evaluate language understanding models’ performance in different domains. It covers 57 subjects across STEM, the humanities, the social sciences, and more. | OpenAI | Question-Answering | |
| TruthfulQA: Evaluate the model’s capability to answer questions accurately and truthfully based on the provided information. | OpenAI | Question-Answering | |
| NarrativeQA: Evaluate your model’s ability to comprehend and answer questions about long and complex narratives, such as stories or articles. | OpenAI | Question-Answering | |
| HellaSWag: Evaluate your model’s ability in completions of sentences. | OpenAI | Question-Answering | |
| BBQ: Evaluate how your model responds to questions in the presence of social biases against protected classes across various social dimensions. | OpenAI | Question-Answering | |
| NQ open: Evaluate the ability of your model to answer open-ended questions based on a given passage or context. | OpenAI | Question-Answering | |
| BoolQ: Evaluate the ability of your model to answer boolean questions (yes/no) based on a given passage or context. | OpenAI | Question-Answering | |
| XSum : Evaluate your model’s ability to generate concise and informative summaries for long articles with the XSum dataset. | OpenAI | Summarization | |
| LogiQA: Evaluate your model’s accuracy on Machine Reading Comprehension with Logical Reasoning questions. | OpenAI | Question-Answering | |
| ASDiv: Evaluate your model’s ability answer questions based on Math Word Problems. | OpenAI | Question-Answering | |
| BigBench: Evaluate your model’s performance on BigBench datasets by Google. | OpenAI | Question-Answering | |
| MultiLexSum: Evaluate your model’s ability to generate concise and informative summaries for legal case contexts from the Multi-LexSum dataset | OpenAI | Summarization | |
| Legal-QA: Evaluate your model’s performance on legal-qa datasets | OpenAI | Legal-Tests | |
| CommonSenseQA: Evaluate your model’s performance on the CommonsenseQA dataset | OpenAI | Question-Answering | |
| SIQA: Evaluate your model’s performance by assessing its accuracy in understanding social situations. | OpenAI | Question-Answering | |
| PIQA: Evaluate your model’s performance on the PIQA dataset, which tests its ability to reason about everyday physical situations | OpenAI | Question-Answering | |
| Fiqa: Evaluate your model’s performance on the FiQA dataset, a comprehensive and specialized resource designed for finance-related question-answering tasks | OpenAI | Question-Answering | |
| LiveQA: Evaluate your model’s performance on the Medical Question Answering Task at TREC 2017 LiveQA | OpenAI | Question-Answering | |
| MedicationQA: Evaluate your model’s performance on the MedicationQA dataset, Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers | OpenAI | Question-Answering | |
| HealthSearchQA: Evaluate your model’s performance on the HealthSearchQA dataset, a Large Language Models Encode Clinical Knowledge question-answering tasks | OpenAI | Question-Answering |