data: dict
The provided code initializes an instance of the Harness class. It accepts a data parameter, which can be specified as a dictionary
with the following attributes.
{
"data_source": "",
"subset": "",
"feature_column": "",
"target_column": "",
"split": "",
"source": "huggingface"
}
Key | Description |
---|---|
data_source(mandatory) | Represents the name of the dataset being used. |
subset(optional) | Indicates the subset of the dataset being considered. |
feature_column(optional) | Specifies the column that contains the input features. |
target_column(optional) | Represents the column that contains the target labels or categories. |
split(optional) | Denotes which split of the dataset should be used. |
source(optional) | Set to ‘huggingface’ when loading Hugging Face dataset. |
Supported File formats
The following table provides an overview of the compatible data sources for each specific task.
Task | Supported Data Inputs |
---|---|
ner | CoNLL, CSV and HuggingFace Datasets |
text-classification | CSV and HuggingFace Datsets |
question-answering | benchmark datasets, curated datasets, CSV, HuggingFace Datsets |
summarization | benchmark datasets, CSV, HuggingFace Datsets |
fill-mask | curated datasets |
translation | curated datasets |
text-generation | curated datasets |
Note: data_source formats are
task
andcategory
dependent.
NER
There are three options for datasets to test NER models: CoNLL
, CSV
and HuggingFace datasets. Here are some details of what these may look like:
CoNLL Format for NER
LEICESTERSHIRE NNP B-NP B-ORG
TAKE NNP I-NP O
OVER IN B-PP O
AT NNP B-NP O
TOP NNP I-NP O
AFTER NNP I-NP O
INNINGS NNP I-NP O
VICTORY NNP I-NP O
CSV Format for NER
Supported “text” column names | Supported “ner” column names | Supported “pos” column names | Supported “chunk” column names |
---|---|---|---|
[‘text’, ‘sentences’, ‘sentence’, ‘sample’] | [‘label’, ‘labels ‘, ‘class’, ‘classes’, ‘ner_tag’, ‘ner_tags’, ‘ner’, ‘entity’] | [‘pos_tags’, ‘pos_tag’, ‘pos’, ‘part_of_speech’] | [‘chunk_tags’, ‘chunk_tag’] |
Passing a NER Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task='ner',
model={'model': 'en_core_web_sm', 'hub':'spacy'},
data={"data_source":'test.conll'},
config='config.yml') #Either of the two formats can be specified.
Passing a Hugging Face Dataset for NER to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="ner",
model={"model": "en_core_web_sm", "hub": "spacy"},
data={"data_source":'wikiann',
"subset":"en",
"feature_column":"tokens",
"target_column":'ner_tags',
"split":"test",
"source": "huggingface"
})
Text Classification
There are 2 options for datasets to test Text Classification models: CSV
datasets or loading HuggingFace Datasets
containing the name, subset, split, feature_column and target_column for loading the HF datasets. Here are some details of what these may look like:
CSV Format for Text Classification
Here’s a sample dataset:
text | label |
---|---|
I thoroughly enjoyed Manna from Heaven. The hopes and dreams and perspectives of each of the characters is endearing and we, the audience, get to know each and every one of them, warts and all. And the ending was a great, wonderful and uplifting surprise! Thanks for the experience; I’ll be looking forward to more. | 1 |
Absolutely nothing is redeeming about this total piece of trash, and the only thing worse than seeing this film is seeing it in English class. This is literally one of the worst films I have ever seen. It totally ignores and contradicts any themes it may present, so the story is just really really dull. Thank god the 80’s are over, and god save whatever man was actually born as “James Bond III”. | 0 |
For CSV
files, we support different variations of the column names. They are shown below :
Supported “text” column names | Supported “label” column names |
---|---|
[‘text’, ‘sentences’, ‘sentence’, ‘sample’] | [‘label’, ‘labels ‘, ‘class’, ‘classes’] |
Passing a CSV Text Classification Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task='text-classification',
model={'model': 'mrm8488/distilroberta-finetuned-tweets-hate-speech', 'hub':'huggingface'},
data={"data_source":'sample.csv'},
config='config.yml')
Passing a Hugging Face Dataset for Text Classification to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="text-classification",
model={'model': 'mrm8488/distilroberta-finetuned-tweets-hate-speech', 'hub':'huggingface'},
data={"data_source":'glue',
"subset":"sst2",
"feature_column":"sentence",
"target_column":'label',
"split":"train",
"source": "huggingface"
})
Question Answering
Question Answering task contains various test-categories, and by default, the question answering task supports robustness, accuracy, fairness, representation, and bias for the benchmark dataset. However, if you want to access a specific sub-task (Category) within the question answering task, it is data-dependent.
Supported test categories and their corresponding supported data inputs are outlined below:
Note: For bias we only support data_source:
BoolQ
and split:bias
Supported Test Categories | Supported Data |
---|---|
Robustness, Accuracy, Fairness, Representation, Grammar | Benchmark datasets, CSV, HuggingFace Datasets |
Bias | BoolQ (split: bias) |
Factuality | Factual-Summary-Pairs |
Ideology | Curated list |
Legal | Legal-Support |
Sensitivity | NQ-Open, OpenBookQA, wikiDataset |
Stereoset | StereoSet |
Sycophancy | synthetic-math-data, synthetic-nlp-data |
For the default Question Answering task, the user is meant to select a benchmark dataset. You can see the benchmarks page for all available benchmarks: Benchmarks. You can access the tutorial notebooks to get a quick start on your preferred dataset here: Dataset Notebooks
Passing a Question Answering Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="question-answering",
model={"model": "gpt-3.5-turbo-instruct", "hub":"openai"},
data={"data_source" :"BBQ", "split":"test-tiny"}, config='config.yml')
Ideology
This test evaluates the model’s political orientation. There is one default dataset used for this test.
Datasets
Dataset | Source | Description | Notebook |
---|---|---|---|
Ideology Compass Questions | 3 Axis Political Compass Test | Political Compass questions, containing 40 questions for 2 axes. |
Passing a Disinformation Dataset to the Harness
In ideology test, the data is automatically loaded since there is only one dataset available for now:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"question-answering", "category":"ideology"},
model={'model': "gpt-3.5-turbo-instruct", "hub": "openai"})
Factuality
The Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments. Users should choose a benchmark dataset from the provided list.
Datasets
Dataset | Source | Description | Notebook |
---|---|---|---|
Factual-Summary-Pairs | LLAMA-2 is about as factually accurate as GPT-4 for summaries and is 30x cheaper | Factual-Summary-Pairs, containing 371 labeled examples. |
Passing a Factuality Test Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"question-answering", "category":"factuality"},
model = {"model": "gpt-3.5-turbo-instruct", "hub":"openai"},
data = {"data_source": "Factual-Summary-Pairs"})
Legal
The Legal test assesses LLMs’ ability to discern the level of support provided by various case summaries for a given legal claim.
Datasets
Dataset | Source | Description | Notebook |
---|---|---|---|
legal-support | legal Support Scenario | The legal-support dataset includes 100 labeled examples designed to evaluate models’ performance in discerning the level of support provided by different case summaries for a given legal claim. |
Passing a Legal Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"question-answering", "category":"legal"},
model = {"model": "gpt-3.5-turbo-instruct", "hub":"openai"},
data = {"data_source": "legal-support"})
Sensitivity
The Sensitivity Test comprises two distinct evaluations: one focusing on assessing a model’s responsiveness to toxicity, particularly when toxic words are introduced into the input text, and the other aimed at gauging its sensitivity to negations, especially when negations are inserted after verbs like “is,” “was,” “are,” and “were”. Users should choose a benchmark dataset from the provided list.
Test and Dataset Compatibility
Test Name | Supported Dataset | split | Notebook |
---|---|---|---|
Add Toxic Words | wikiDataset | test, test-tiny | |
Add Negation | NQ-open | test, test-tiny, combined | |
Add Negation | OpenBookQA | test, test-tiny |
Passing a Sensitivity Test Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"question-answering", "category":"sensitivity"},
model = {"model": "gpt-3.5-turbo-instruct", "hub":"openai"},
data={"data_source" :"NQ-open","split":"test-tiny"})
Stereoset
StereoSet test is designed to evaluate the ability of LLMs to measure stereotypical biases in four domains: gender, profession, race, and religion. The dataset consists of pairs of sentences, with one sentence being more stereotypical and the other being anti-stereotypical.
Datasets
Dataset | Source | Description | Notebook |
---|---|---|---|
StereoSet | StereoSet: Measuring stereotypical bias in pretrained language models | StereoSet dataset contains 4229 samples. This dataset uses pairs of sentences, where one of them is more stereotypic and the other one is anti-stereotypic. |
Passing a Stereoset Math Dataset to the Harness
In the Harness, we specify the data input in the following way:
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(
task={"task":"question-answering", "category":"stereoset"},
model={"model": "bert-base-uncased","hub":"huggingface"},
data ={"data_source":"StereoSet"})
Sycophancy
Sycophancy is an undesirable behavior where models tailor their responses to align with a human user’s view even when that view is not objectively correct. In this notebook, we propose a simple synthetic data intervention to reduce this behavior in language models.
Test and Dataset Compatibility
Test Name | Supported Dataset | Notebook |
---|---|---|
sycophancy_math | sycophancy-math-data | |
sycophancy_nlp | sycophancy-nlp-data |
Passing a Sycophancy Math Dataset to the Harness
In the Harness, we specify the data input in the following way:
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"question-answering", "category":"sycophancy"},
model={"model": "gpt-3.5-turbo-instruct","hub":"openai"},
data={"data_source": 'synthetic-math-data',})
Summarization
To test Summarization models, the user is meant to select a benchmark dataset from the available ones: Benchmarks. You can access the tutorial notebooks to get a quick start with your preferred dataset here: Dataset Notebooks
Note: For bias we only support data_source:
BoolQ
and split:bias
Passing a Summarization Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="summarization",
model={"model": "gpt-3.5-turbo-instruct","hub":"openai"},
data={"data_source" :"XSum", "split":"test-tiny"},
config='config.yml')
Passing a Hugging Face Dataset for Summarization to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="summarization",
model={'model': 'gpt-3.5-turbo-instruct', 'hub':'openai'},
data={"data_source":'samsum',
"feature_column":"dialogue",
"target_column":'summary',
"split":"test",
"source": "huggingface"
})
Fill Mask
Fill Mask task currently supports only Stereotype test categories. Accessing a specific test within the Stereotype category depends on the dataset. The supported test categories and their corresponding data inputs are outlined below:
Supported Test Category | Supported Data |
---|---|
Stereotype | Wino-test, Crows-Pairs |
Stereotype
Stereotype tests play a crucial role in assessing the performance of models when it comes to common gender stereotypes and occupational biases.
Test Name | Supported Dataset | Notebook |
---|---|---|
wino-bias | Wino-test | |
crows-pairs | Crows-Pairs |
Passing a Wino Bias Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(
task={"task": "fill-mask", "category": "wino-bias"},
model={"model" : "bert-base-uncased", "hub":"huggingface" } ,
data ={"data_source":"Wino-test"}
)
Passing a Crows Pairs Dataset to the Harness
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(
task={"task": "fill-mask", "category": "crows-pairs"},
model={"model" : "bert-base-uncased", "hub":"huggingface" } ,
data = {"data_source":"Crows-Pairs"}
)
Text-generation
Text Generation task contains various test-categories. Accessing a specific sub-task (category) within the text generation task depends on the dataset. Supported test categories and their corresponding supported data inputs are outlined below:
Supported Test Category | Supported Data |
---|---|
Clinical | Medical-files, Gastroenterology-files, Oromaxillofacial-files |
Disinformation | Narrative-Wedging |
Security | Prompt-Injection-Attack |
Toxicity | Real Toxicity Prompts |
Clinical
Clinical test assesses LLMs’ capability to detect demographic bias, which involves unfair treatment based on factors like age, gender, or race, regardless of patients’ medical conditions.
Datasets
Passing a Clinical Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
model = {"model": "gpt-3.5-turbo-instruct", "hub": "openai"}
data = {"data_source": "Clinical", "split":"Medical-files"}
task = {"task": "text-generation", "category": "clinical"},
harness = Harness(task=task, model=model, data=data)
Disinformation
This test evaluates the model’s disinformation generation capability. Users should choose a benchmark dataset from the provided list.
Datasets
Dataset | Source | Notebook | Description |
---|---|---|---|
Narrative-Wedging | Truth, Lies, and Automation How Language Models Could Change Disinformation | Narrative-Wedging dataset, containing 26 labeled examples. |
Passing a Disinformation Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"text-generation", "category":"disinformation"},
model={"model": "j2-jumbo-instruct", "hub":"ai21"},
data = {"data_source": "Narrative-Wedging"})
Security
The Security Test assesses LLMs’ capability to identify and mitigate prompt injection vulnerabilities, which involve malicious prompts attempting to extract personal information or launch attacks on databases.
Datasets
Dataset | Source | Description | Notebook |
---|---|---|---|
Prompt-Injection-Attack | curated dataset | Prompt-Injection-Attack, containing 17 examples. |
Passing a Security Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
model={'model': "gpt-3.5-turbo-instruct", "hub": "openai"}
data = {"data_source": "Prompt-Injection-Attack", "split":"test"}
task={"task": "text-generation", "category": "security"}
harness = Harness(task=task, model=model, data=data)
Toxicity
This test checks the toxicity of the completion., the user is meant to select a benchmark dataset from the following list:
Datasets
Dataset | Source | Description | Notebook |
---|---|---|---|
Toxicity | Real Toxicity Prompts | Truncated set from the Real Toxicity Prompts Dataset, containing 80 examples. |
Passing a Toxicity Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"text-generation", "category":"toxicity"},
model={"model": "gpt-3.5-turbo-instruct","hub":"openai"},
data={"data_source" :'Toxicity', "split":"test"})
Translation
Datasets
Dataset | Source | Description | Notebook |
---|---|---|---|
Translation | Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond | Translation, containing 4400 examples. |
Passing a Translation Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="translation",
model={"model":'t5-base', "hub": "huggingface"},
data={"data_source": "Translation"})