The tables presented below offer a comprehensive overview of diverse categories and tests, providing valuable insights into the varied testing procedures.
Accuracy Tests
Accuracy testing is vital for evaluating a machine learning model’s performance. It gauges the model’s ability to predict outcomes on an unseen test dataset by comparing predicted and actual outputs. Several tests, including labelwise metrics (precision, recall, F1 score) and overall metrics (micro F1, macro F1, weighted F1), support this assessment.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Accuracy | Min F1 Score | ner , text-classification |
Accuracy | Min Macro-F1 Score | ner , text-classification |
Accuracy | Min Micro-F1 Score | ner , text-classification |
Accuracy | Min Precision Score | ner , text-classification |
Accuracy | Min Recall Score | ner , text-classification |
Accuracy | Min Weighted-F1 Score | ner , text-classification |
Accuracy | Min Exact Match Score | question-answering , summarization |
Accuracy | Min BLEU Score | question-answering , summarization |
Accuracy | Min Rouge1 Score | question-answering , summarization |
Accuracy | Min Rouge2 Score | question-answering , summarization |
Accuracy | Min RougeL Score | question-answering , summarization |
Accuracy | Min RougeLsum Score | question-answering , summarization |
Bias Tests
The primary goal of model bias tests is to assess how well a model aligns its predictions with actual outcomes. Model bias, the systematic skewing of results, can lead to negative consequences such as perpetuating stereotypes or discrimination. In this context, the objective is to explore how replacing documents with different genders, ethnicities, religions, or countries impacts the model’s predictions compared to the original training set.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Bias | Replace To Asian First Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Asian Last Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Black First Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Black Last Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Buddhist Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Christian Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Female Pronouns | ner , text-classification , question-answering , summarization |
Bias | Replace To High Income Country | ner , text-classification , question-answering , summarization |
Bias | Replace To Hindu Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Hispanic First Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Hispanic Last Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Interracial Last Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Jain Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Lower Middle Income Country | ner , text-classification , question-answering , summarization |
Bias | Replace To Low Income Country | ner , text-classification , question-answering , summarization |
Bias | Replace To Male Pronouns | ner , text-classification , question-answering , summarization |
Bias | Replace To Muslim Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Native American Last Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Neutral Pronouns | ner , text-classification , question-answering , summarization |
Bias | Replace To Parsi Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Sikh Names | ner , text-classification , question-answering , summarization |
Bias | Replace To Upper Middle Income Country | ner , text-classification , question-answering , summarization |
Bias | Replace To White First Names | ner , text-classification , question-answering , summarization |
Bias | Replace To White Last Names | ner , text-classification , question-answering , summarization |
Fairness Tests
The core objective of fairness testing is to evaluate a machine learning model’s performance without bias, especially in contexts with implications for specific groups. This testing ensures that the model does not favor or discriminate against any group, aiming for unbiased results across all groups. Various tests, including those focused on attributes like gender, support this evaluation.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Fairness | Max Gender F1 Score | ner , text-classification |
Fairness | Min Gender F1 Score | ner , text-classification |
Fairness | Min Gender Rouge1 Score | question-answering , summarization |
Fairness | Min Gender Rouge2 Score | question-answering , summarization |
Fairness | Min Gender RougeL Score | question-answering , summarization |
Fairness | Min Gender RougeLSum Score | question-answering , summarization |
Fairness | Max Gender Rouge1 Score | question-answering , summarization |
Fairness | Max Gender Rouge2 Score | question-answering , summarization |
Fairness | Max Gender RougeL Score | question-answering , summarization |
Fairness | Max Gender RougeLSum Score | question-answering , summarization |
Representation Tests
The goal of representation testing is to assess whether a dataset accurately represents a specific population or if biases within it could adversely affect the results of any analysis.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Representation | Min Country Economic Representation Count | ner , text-classification , question-answering , summarization |
Representation | Min Country Economic Representation Proportion | ner , text-classification , question-answering , summarization |
Representation | Min Ethnicity Representation Count | ner , text-classification , question-answering , summarization |
Representation | Min Ethnicity Representation Proportion | ner , text-classification , question-answering , summarization |
Representation | Min Gender Representation Count | ner , text-classification , question-answering , summarization |
Representation | Min Gender Representation Proportion | ner , text-classification , question-answering , summarization |
Representation | Min Label Representation Count | ner , text-classification |
Representation | Min Label Representation Proportion | ner , text-classification |
Representation | Min Religion Name Representation Count | ner , text-classification , question-answering , summarization |
Representation | Min Religion Name Representation Proportion | ner , text-classification , question-answering , summarization |
Robustness Tests
The primary goal of model robustness tests is to evaluate the model’s ability to maintain consistent levels of accuracy, precision, and recall when subjected to perturbations in the data it predicts. In tasks like Named Entity Recognition (NER), the focus is on assessing how variations in input data, such as documents with typos or fully uppercased sentences, impact the model’s prediction performance compared to documents similar to those in the original training set.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Robustness | Add Context | ner , text-classification , question-answering , summarization , translation |
Robustness | Add Contraction | ner , text-classification , question-answering , summarization , translation |
Robustness | Add Punctuation | ner , text-classification , question-answering , summarization , translation |
Robustness | Add Typo | ner , text-classification , question-answering , summarization , translation |
Robustness | American to British | ner , text-classification , question-answering , summarization , translation |
Robustness | British to American | ner , text-classification , question-answering , summarization , translation |
Robustness | Lowercase | ner , text-classification , question-answering , summarization , translation |
Robustness | Strip Punctuation | ner , text-classification , question-answering , summarization , translation |
Robustness | Swap Entities | ner |
Robustness | Titlecase | ner , text-classification , question-answering , summarization , translation |
Robustness | Uppercase | ner , text-classification , question-answering , summarization , translation |
Robustness | Number to Word | ner , text-classification , question-answering , summarization , translation |
Robustness | Add OCR Typo | ner , text-classification , question-answering , summarization . translation |
Robustness | Dyslexia Word Swap | ner , text-classification , question-answering , summarization , translation |
Robustness | Add Slangs | ner , text-classification , question-answering , summarization , translation |
Robustness | Add Speech to Text Typo | ner , text-classification , question-answering , summarization , translation |
Robustness | Add Abbreviations | ner , text-classification , question-answering , summarization |
Robustness | Multiple Perturbations | text-classification , question-answering , summarization , translation |
Robustness | Adjective Synonym Swap | ner , text-classification , question-answering , summarization , translation |
Robustness | Adjective Antonym Swap | ner , text-classification , question-answering , summarization , translation |
Robustness | Strip All Punctution | ner , text-classification , question-answering , summarization , translation |
Robustness | Randomize Age | ner , text-classification , question-answering , summarization , translation |
Toxicity Tests
The primary goal of toxicity tests is to assess the ideological toxicity score of a given text, specifically targeting demeaning speech based on political, philosophical, or social beliefs. This includes evaluating instances of hate speech rooted in individual ideologies, such as feminism, left-wing politics, or right-wing politics
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Toxicity | Offensive | text-generation |
Toxicity | ideology | text-generation |
Toxicity | lgbtqphobia | text-generation |
Toxicity | racism | text-generation |
Toxicity | sexism | text-generation |
Toxicity | xenophobia | text-generation |
Sensitivity Tests
The primary objective of the sensitivity test is to assess the model’s responsiveness when introducing negation and toxic words, gauging its level of sensitivity in these scenarios.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Sensitivity | Negation | question-answering |
Sensitivity | Toxicity | question-answering |
Sycophancy Tests
The primary goal of addressing sycophancy in language models is to mitigate undesirable behaviors where models tailor their responses to align with a human user’s view, even when that view is not objectively correct.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Sycophancy | Sycophancy Math | question-answering |
Sycophancy | Sycophancy NLP | question-answering |
Stereotype Tests
The primary goal of stereotype tests is to evaluate how well models perform when confronted with common gender stereotypes, occupational stereotypes, or other prevailing biases. In these assessments, models are scrutinized for their propensity to perpetuate or challenge stereotypical associations, shedding light on their capacity to navigate and counteract biases in their predictions.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Stereotype | gender-occupational-stereotype | fill-mask |
Stereotype | common-stereotypes | fill-mask |
StereoSet Tests
The primary goal of StereoSet is to provide a comprehensive dataset and method for assessing bias in Language Models (LLMs). Utilizing pairs of sentences, StereoSet contrasts one sentence that embodies a stereotypic perspective with another that presents an anti-stereotypic view. This approach facilitates a nuanced evaluation of LLMs, shedding light on their sensitivity to and reinforcement or mitigation of stereotypical biases.”
Test Category | Test Name | Suppoted Tasks |
---|---|---|
StereoSet | intersentence | question-answering |
StereoSet | intrasentence | question-answering |
Ideology Tests
The Idology Test is a popular tool assessing political beliefs on a two-dimensional grid, surpassing the traditional left-right spectrum. The Political Compass aims for a nuanced understanding, avoiding oversimplification and capturing the full range of political opinions and beliefs.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Ideology | Political Compass | ideology |
Legal Tests
The primary goal of the Legal benchmark Test is to assess a model’s capacity to reason about the strength of support provided by a given case summary. This evaluation aims to gauge the model’s proficiency in legal reasoning and comprehension.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Legal | legal-support | question-answering |
Clinical Test
The Clinical Test evaluates the model for potential demographic bias in suggesting treatment plans for two patients with identical diagnoses. This assessment aims to uncover and address any disparities in the model’s recommendations based on demographic factors.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Clinical | demographic-bias | text-generation |
Security Test
The Security Test, featuring the Prompt Injection Attack, is designed to assess prompt injection vulnerabilities in Language Models (LLMs). This test specifically evaluates the model’s resilience against adversarial attacks, gauging its ability to handle sensitive information appropriately and ensuring robust security measures
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Security | prompt_injection_attack | text-generation |
Disinformation Test
The Disinformation Test aims to evaluate the model’s capacity to generate disinformation. By presenting the model with disinformation prompts, the experiment assesses whether the model produces content that aligns with the given input, providing insights into its susceptibility to generating misleading or inaccurate information.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Disinformation | Narrative Wedging | text-generation |
Factuality Test
The Factuality Test is designed to evaluate the ability of language models (LLMs) to determine the factuality of statements within summaries. This test is particularly relevant for assessing the accuracy of LLM-generated summaries and understanding potential biases that might affect their judgments.
Test Category | Test Name | Suppoted Tasks |
---|---|---|
Factuality | Order Bias | question-answering |