Available Tests

 

The tables presented below offer a comprehensive overview of diverse categories and tests, providing valuable insights into the varied testing procedures.

Accuracy Tests

Accuracy testing is vital for evaluating a machine learning model’s performance. It gauges the model’s ability to predict outcomes on an unseen test dataset by comparing predicted and actual outputs. Several tests, including labelwise metrics (precision, recall, F1 score) and overall metrics (micro F1, macro F1, weighted F1), support this assessment.

Test Category Test Name Suppoted Tasks
Accuracy Min F1 Score ner, text-classification
Accuracy Min Macro-F1 Score ner, text-classification
Accuracy Min Micro-F1 Score ner, text-classification
Accuracy Min Precision Score ner, text-classification
Accuracy Min Recall Score ner, text-classification
Accuracy Min Weighted-F1 Score ner, text-classification
Accuracy Min Exact Match Score question-answering, summarization
Accuracy Min BLEU Score question-answering, summarization
Accuracy Min Rouge1 Score question-answering, summarization
Accuracy Min Rouge2 Score question-answering, summarization
Accuracy Min RougeL Score question-answering, summarization
Accuracy Min RougeLsum Score question-answering, summarization

Bias Tests

The primary goal of model bias tests is to assess how well a model aligns its predictions with actual outcomes. Model bias, the systematic skewing of results, can lead to negative consequences such as perpetuating stereotypes or discrimination. In this context, the objective is to explore how replacing documents with different genders, ethnicities, religions, or countries impacts the model’s predictions compared to the original training set.

Test Category Test Name Suppoted Tasks
Bias Replace To Asian First Names ner, text-classification, question-answering, summarization
Bias Replace To Asian Last Names ner, text-classification, question-answering, summarization
Bias Replace To Black First Names ner, text-classification, question-answering, summarization
Bias Replace To Black Last Names ner, text-classification, question-answering, summarization
Bias Replace To Buddhist Names ner, text-classification, question-answering, summarization
Bias Replace To Christian Names ner, text-classification, question-answering, summarization
Bias Replace To Female Pronouns ner, text-classification, question-answering, summarization
Bias Replace To High Income Country ner, text-classification, question-answering, summarization
Bias Replace To Hindu Names ner, text-classification, question-answering, summarization
Bias Replace To Hispanic First Names ner, text-classification, question-answering, summarization
Bias Replace To Hispanic Last Names ner, text-classification, question-answering, summarization
Bias Replace To Interracial Last Names ner, text-classification, question-answering, summarization
Bias Replace To Jain Names ner, text-classification, question-answering, summarization
Bias Replace To Lower Middle Income Country ner, text-classification, question-answering, summarization
Bias Replace To Low Income Country ner, text-classification, question-answering, summarization
Bias Replace To Male Pronouns ner, text-classification, question-answering, summarization
Bias Replace To Muslim Names ner, text-classification, question-answering, summarization
Bias Replace To Native American Last Names ner, text-classification, question-answering, summarization
Bias Replace To Neutral Pronouns ner, text-classification, question-answering, summarization
Bias Replace To Parsi Names ner, text-classification, question-answering, summarization
Bias Replace To Sikh Names ner, text-classification, question-answering, summarization
Bias Replace To Upper Middle Income Country ner, text-classification, question-answering, summarization
Bias Replace To White First Names ner, text-classification, question-answering, summarization
Bias Replace To White Last Names ner, text-classification, question-answering, summarization

Fairness Tests

The core objective of fairness testing is to evaluate a machine learning model’s performance without bias, especially in contexts with implications for specific groups. This testing ensures that the model does not favor or discriminate against any group, aiming for unbiased results across all groups. Various tests, including those focused on attributes like gender, support this evaluation.

Test Category Test Name Suppoted Tasks
Fairness Max Gender F1 Score ner, text-classification
Fairness Min Gender F1 Score ner, text-classification
Fairness Min Gender Rouge1 Score question-answering, summarization
Fairness Min Gender Rouge2 Score question-answering, summarization
Fairness Min Gender RougeL Score question-answering, summarization
Fairness Min Gender RougeLSum Score question-answering, summarization
Fairness Max Gender Rouge1 Score question-answering, summarization
Fairness Max Gender Rouge2 Score question-answering, summarization
Fairness Max Gender RougeL Score question-answering, summarization
Fairness Max Gender RougeLSum Score question-answering, summarization

Representation Tests

The goal of representation testing is to assess whether a dataset accurately represents a specific population or if biases within it could adversely affect the results of any analysis.

Test Category Test Name Suppoted Tasks
Representation Min Country Economic Representation Count ner, text-classification, question-answering, summarization
Representation Min Country Economic Representation Proportion ner, text-classification, question-answering, summarization
Representation Min Ethnicity Representation Count ner, text-classification, question-answering, summarization
Representation Min Ethnicity Representation Proportion ner, text-classification, question-answering, summarization
Representation Min Gender Representation Count ner, text-classification, question-answering, summarization
Representation Min Gender Representation Proportion ner, text-classification, question-answering, summarization
Representation Min Label Representation Count ner, text-classification
Representation Min Label Representation Proportion ner, text-classification
Representation Min Religion Name Representation Count ner, text-classification, question-answering, summarization
Representation Min Religion Name Representation Proportion ner, text-classification, question-answering, summarization

Robustness Tests

The primary goal of model robustness tests is to evaluate the model’s ability to maintain consistent levels of accuracy, precision, and recall when subjected to perturbations in the data it predicts. In tasks like Named Entity Recognition (NER), the focus is on assessing how variations in input data, such as documents with typos or fully uppercased sentences, impact the model’s prediction performance compared to documents similar to those in the original training set.

Test Category Test Name Suppoted Tasks
Robustness Add Context ner, text-classification, question-answering, summarization , translation
Robustness Add Contraction ner, text-classification, question-answering, summarization, translation
Robustness Add Punctuation ner, text-classification, question-answering, summarization , translation
Robustness Add Typo ner, text-classification, question-answering, summarization, translation
Robustness American to British ner, text-classification, question-answering, summarization, translation
Robustness British to American ner, text-classification, question-answering, summarization, translation
Robustness Lowercase ner, text-classification, question-answering, summarization, translation
Robustness Strip Punctuation ner, text-classification, question-answering, summarization, translation
Robustness Swap Entities ner
Robustness Titlecase ner, text-classification, question-answering, summarization, translation
Robustness Uppercase ner, text-classification, question-answering, summarization , translation
Robustness Number to Word ner, text-classification, question-answering, summarization, translation
Robustness Add OCR Typo ner, text-classification, question-answering, summarization. translation
Robustness Dyslexia Word Swap ner, text-classification, question-answering, summarization, translation
Robustness Add Slangs ner, text-classification, question-answering, summarization, translation
Robustness Add Speech to Text Typo ner, text-classification, question-answering, summarization, translation
Robustness Add Abbreviations ner, text-classification, question-answering, summarization
Robustness Multiple Perturbations text-classification, question-answering, summarization, translation
Robustness Adjective Synonym Swap ner, text-classification, question-answering, summarization, translation
Robustness Adjective Antonym Swap ner, text-classification, question-answering, summarization, translation
Robustness Strip All Punctution ner, text-classification, question-answering, summarization, translation
Robustness Randomize Age ner, text-classification, question-answering, summarization, translation

Toxicity Tests

The primary goal of toxicity tests is to assess the ideological toxicity score of a given text, specifically targeting demeaning speech based on political, philosophical, or social beliefs. This includes evaluating instances of hate speech rooted in individual ideologies, such as feminism, left-wing politics, or right-wing politics

Test Category Test Name Suppoted Tasks
Toxicity Offensive text-generation
Toxicity ideology text-generation
Toxicity lgbtqphobia text-generation
Toxicity racism text-generation
Toxicity sexism text-generation
Toxicity xenophobia text-generation

Sensitivity Tests

The primary objective of the sensitivity test is to assess the model’s responsiveness when introducing negation and toxic words, gauging its level of sensitivity in these scenarios.

Test Category Test Name Suppoted Tasks
Sensitivity Negation question-answering
Sensitivity Toxicity question-answering

Sycophancy Tests

The primary goal of addressing sycophancy in language models is to mitigate undesirable behaviors where models tailor their responses to align with a human user’s view, even when that view is not objectively correct.

Test Category Test Name Suppoted Tasks
Sycophancy Sycophancy Math question-answering
Sycophancy Sycophancy NLP question-answering

Stereotype Tests

The primary goal of stereotype tests is to evaluate how well models perform when confronted with common gender stereotypes, occupational stereotypes, or other prevailing biases. In these assessments, models are scrutinized for their propensity to perpetuate or challenge stereotypical associations, shedding light on their capacity to navigate and counteract biases in their predictions.

Test Category Test Name Suppoted Tasks
Stereotype gender-occupational-stereotype fill-mask
Stereotype common-stereotypes fill-mask

StereoSet Tests

The primary goal of StereoSet is to provide a comprehensive dataset and method for assessing bias in Language Models (LLMs). Utilizing pairs of sentences, StereoSet contrasts one sentence that embodies a stereotypic perspective with another that presents an anti-stereotypic view. This approach facilitates a nuanced evaluation of LLMs, shedding light on their sensitivity to and reinforcement or mitigation of stereotypical biases.”

Test Category Test Name Suppoted Tasks
StereoSet intersentence question-answering
StereoSet intrasentence question-answering

Ideology Tests

The Idology Test is a popular tool assessing political beliefs on a two-dimensional grid, surpassing the traditional left-right spectrum. The Political Compass aims for a nuanced understanding, avoiding oversimplification and capturing the full range of political opinions and beliefs.

Test Category Test Name Suppoted Tasks
Ideology Political Compass ideology

The primary goal of the Legal benchmark Test is to assess a model’s capacity to reason about the strength of support provided by a given case summary. This evaluation aims to gauge the model’s proficiency in legal reasoning and comprehension.

Test Category Test Name Suppoted Tasks
Legal legal-support question-answering

Clinical Test

The Clinical Test evaluates the model for potential demographic bias in suggesting treatment plans for two patients with identical diagnoses. This assessment aims to uncover and address any disparities in the model’s recommendations based on demographic factors.

Test Category Test Name Suppoted Tasks
Clinical demographic-bias text-generation

Security Test

The Security Test, featuring the Prompt Injection Attack, is designed to assess prompt injection vulnerabilities in Language Models (LLMs). This test specifically evaluates the model’s resilience against adversarial attacks, gauging its ability to handle sensitive information appropriately and ensuring robust security measures

Test Category Test Name Suppoted Tasks
Security prompt_injection_attack text-generation

Disinformation Test

The Disinformation Test aims to evaluate the model’s capacity to generate disinformation. By presenting the model with disinformation prompts, the experiment assesses whether the model produces content that aligns with the given input, providing insights into its susceptibility to generating misleading or inaccurate information.

Test Category Test Name Suppoted Tasks
Disinformation Narrative Wedging text-generation

Factuality Test

The Factuality Test is designed to evaluate the ability of language models (LLMs) to determine the factuality of statements within summaries. This test is particularly relevant for assessing the accuracy of LLM-generated summaries and understanding potential biases that might affect their judgments.

Test Category Test Name Suppoted Tasks
Factuality Order Bias question-answering
Last updated