Available Benchmarks

 

LangTest supports many benchmark datasets for testing your models. These are generally for LLM’s and focus on different abilities of LLM’s such as question answering and summarization. There are also benchmarks to test a model’s performance on metrics like robustness, accuracy and fairness.

Dataset Task Category Source Colab
ASDiv question-answering robustness, accuracy, fairness A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers Open In Colab
BBQ question-answering robustness, accuracy, fairness BBQ Dataset: A Hand-Built Bias Benchmark for Question Answering Open In Colab
Bigbench question-answering robustness, accuracy, fairness Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models Open In Colab
BoolQ question-answering robustness, accuracy, fairness,bias BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions Open In Colab
CommonsenseQA question-answering robustness, accuracy, fairness CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge Open In Colab
FIQA question-answering robustness, accuracy, fairness FIQA (Financial Opinion Mining and Question Answering) Open In Colab
HellaSwag question-answering robustness, accuracy, fairness HellaSwag: Can a Machine Really Finish Your Sentence? Open In Colab
Consumer-Contracts question-answering robustness, accuracy, fairness Answer yes/no questions on the rights and obligations created by clauses in terms of services agreements. Open In Colab
Contracts question-answering robustness, accuracy, fairness Answer yes/no questions about whether contractual clauses discuss particular issues. Open In Colab
Privacy-Policy question-answering robustness, accuracy, fairness Given a question and a clause from a privacy policy, determine if the clause contains enough information to answer the question. Open In Colab
LogiQA question-answering robustness, accuracy, fairness LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning Open In Colab
MMLU question-answering robustness, accuracy, fairness MMLU: Measuring Massive Multitask Language Understanding Open In Colab
NarrativeQA question-answering robustness, accuracy, fairness The NarrativeQA Reading Comprehension Challenge Open In Colab
NQ-open question-answering robustness, accuracy, fairness Natural Questions: A Benchmark for Question Answering Research Open In Colab
OpenBookQA question-answering robustness, accuracy, fairness Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering Open In Colab
PIQA question-answering robustness PIQA: Reasoning about Physical Commonsense in Natural Language Open In Colab
Quac question-answering robustness, accuracy, fairness Quac: Question Answering in Context Open In Colab
SIQA question-answering robustness, accuracy, fairness SocialIQA: Commonsense Reasoning about Social Interactions Open In Colab
TruthfulQA question-answering robustness, accuracy, fairness TruthfulQA: Measuring How Models Mimic Human Falsehoods Open In Colab
XSum summarization robustness, accuracy, fairness, bias Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization Open In Colab
MultiLexSum summarization robustness, accuracy, fairness Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities Open In Colab
MedMCQA question-answering robustness, accuracy, fairness MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering Open In Colab
MedQA question-answering robustness, accuracy, fairness What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams Open In Colab
PubMedQA question-answering robustness, accuracy, fairness PubMedQA: A Dataset for Biomedical Research Question Answering Open In Colab
LiveQA question-answering robustness Overview of the Medical Question Answering Task at TREC 2017 LiveQA Open In Colab
MedicationQA question-answering robustness Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers Open In Colab
HealthSearchQA question-answering robustness Large Language Models Encode Clinical Knowledge Open In Colab
Last updated