Benchmark Dataset Notebooks

 

The following table provides an overview of various Benchmark Dataset notebooks. User can select from a list of benchmark datasets provided below to test their LLMs.

Tutorial Description Hub Task Open In Colab
OpenbookQA: Evaluate your model’s ability to answer questions that require complex reasoning and inference based on general knowledge, similar to an “open-book” exam. OpenAI Question-Answering Open In Colab
Quac : Evaluate your model’s ability to answer questions given a conversational context, focusing on dialogue-based question-answering. OpenAI Question-Answering Open In Colab
MMLU : Evaluate language understanding models’ performance in different domains. It covers 57 subjects across STEM, the humanities, the social sciences, and more. OpenAI Question-Answering Open In Colab
TruthfulQA: Evaluate the model’s capability to answer questions accurately and truthfully based on the provided information. OpenAI Question-Answering Open In Colab
NarrativeQA: Evaluate your model’s ability to comprehend and answer questions about long and complex narratives, such as stories or articles. OpenAI Question-Answering Open In Colab
HellaSWag: Evaluate your model’s ability in completions of sentences. OpenAI Question-Answering Open In Colab
BBQ: Evaluate how your model responds to questions in the presence of social biases against protected classes across various social dimensions. OpenAI Question-Answering Open In Colab
NQ open: Evaluate the ability of your model to answer open-ended questions based on a given passage or context. OpenAI Question-Answering Open In Colab
BoolQ: Evaluate the ability of your model to answer boolean questions (yes/no) based on a given passage or context. OpenAI Question-Answering Open In Colab
XSum : Evaluate your model’s ability to generate concise and informative summaries for long articles with the XSum dataset. OpenAI Summarization Open In Colab
LogiQA: Evaluate your model’s accuracy on Machine Reading Comprehension with Logical Reasoning questions. OpenAI Question-Answering Open In Colab
ASDiv: Evaluate your model’s ability answer questions based on Math Word Problems. OpenAI Question-Answering Open In Colab
BigBench: Evaluate your model’s performance on BigBench datasets by Google. OpenAI Question-Answering Open In Colab
MultiLexSum: Evaluate your model’s ability to generate concise and informative summaries for legal case contexts from the Multi-LexSum dataset OpenAI Summarization Open In Colab
Legal-QA: Evaluate your model’s performance on legal-qa datasets OpenAI Legal-Tests Open In Colab
CommonSenseQA: Evaluate your model’s performance on the CommonsenseQA dataset OpenAI Question-Answering Open In Colab
SIQA: Evaluate your model’s performance by assessing its accuracy in understanding social situations. OpenAI Question-Answering Open In Colab
PIQA: Evaluate your model’s performance on the PIQA dataset, which tests its ability to reason about everyday physical situations OpenAI Question-Answering Open In Colab
Fiqa: Evaluate your model’s performance on the FiQA dataset, a comprehensive and specialized resource designed for finance-related question-answering tasks OpenAI Question-Answering Open In Colab
LiveQA: Evaluate your model’s performance on the Medical Question Answering Task at TREC 2017 LiveQA OpenAI Question-Answering Open In Colab
MedicationQA: Evaluate your model’s performance on the MedicationQA dataset, Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers OpenAI Question-Answering Open In Colab
HealthSearchQA: Evaluate your model’s performance on the HealthSearchQA dataset, a Large Language Models Encode Clinical Knowledge question-answering tasks OpenAI Question-Answering Open In Colab
Last updated