Other Benchmarks Dataset

 

LangTest supports additional benchmark datasets for testing your models, and the listings for these datasets can be found below.

Dataset Task Category Source Colab
ASDiv question-answering robustness, accuracy, fairness A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers Open In Colab
BBQ question-answering robustness, accuracy, fairness BBQ Dataset: A Hand-Built Bias Benchmark for Question Answering Open In Colab
Bigbench question-answering robustness, accuracy, fairness Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models Open In Colab
BoolQ question-answering robustness, accuracy, fairness,bias BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions Open In Colab
LogiQA question-answering robustness, accuracy, fairness LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning Open In Colab
MMLU question-answering robustness, accuracy, fairness MMLU: Measuring Massive Multitask Language Understanding Open In Colab
NarrativeQA question-answering robustness, accuracy, fairness The NarrativeQA Reading Comprehension Challenge Open In Colab
NQ-open question-answering robustness, accuracy, fairness Natural Questions: A Benchmark for Question Answering Research Open In Colab
Quac question-answering robustness, accuracy, fairness Quac: Question Answering in Context Open In Colab
TruthfulQA question-answering robustness, accuracy, fairness TruthfulQA: Measuring How Models Mimic Human Falsehoods Open In Colab
XSum summarization robustness, accuracy, fairness, bias Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization Open In Colab
Last updated