Data

data: dict

The provided code initializes an instance of the Harness class. It accepts a data parameter, which can be specified as a dictionary with the following attributes.

{
   "data_source": "",
   "subset": "",
   "feature_column": "",
   "target_column": "",
   "split": "",
   "source": "huggingface"
}

Key	Description
data_source(mandatory)	Represents the name of the dataset being used.
subset(optional)	Indicates the subset of the dataset being considered.
feature_column(optional)	Specifies the column that contains the input features.
target_column(optional)	Represents the column that contains the target labels or categories.
split(optional)	Denotes which split of the dataset should be used.
source(optional)	Set to ‘huggingface’ when loading Hugging Face dataset.

Supported File formats

The following table provides an overview of the compatible data sources for each specific task.

Task	Supported Data Inputs
ner	CoNLL, CSV and HuggingFace Datasets
text-classification	CSV and HuggingFace Datsets
question-answering	benchmark datasets, curated datasets, CSV, HuggingFace Datsets
summarization	benchmark datasets, CSV, HuggingFace Datsets
fill-mask	curated datasets
translation	curated datasets
text-generation	curated datasets

Note: data_source formats are task and category dependent.

NER

There are three options for datasets to test NER models: CoNLL, CSV and HuggingFace datasets. Here are some details of what these may look like:

CoNLL Format for NER

LEICESTERSHIRE NNP B-NP B-ORG
TAKE           NNP I-NP O
OVER           IN  B-PP O
AT             NNP B-NP O
TOP            NNP I-NP O
AFTER          NNP I-NP O
INNINGS        NNP I-NP O
VICTORY        NNP I-NP O

CSV Format for NER

Supported “text” column names	Supported “ner” column names	Supported “pos” column names	Supported “chunk” column names
[‘text’, ‘sentences’, ‘sentence’, ‘sample’]	[‘label’, ‘labels ‘, ‘class’, ‘classes’, ‘ner_tag’, ‘ner_tags’, ‘ner’, ‘entity’]	[‘pos_tags’, ‘pos_tag’, ‘pos’, ‘part_of_speech’]	[‘chunk_tags’, ‘chunk_tag’]

Passing a NER Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task='ner',
                  model={'model': 'en_core_web_sm', 'hub':'spacy'},
                  data={"data_source":'test.conll'},
                  config='config.yml') #Either of the two formats can be specified.

Passing a Hugging Face Dataset for NER to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="ner",
                  model={"model": "en_core_web_sm", "hub": "spacy"},
                  data={"data_source":'wikiann',
                  "subset":"en",
                  "feature_column":"tokens",
                  "target_column":'ner_tags',
                  "split":"test",
                  "source": "huggingface"
                  })

Text Classification

There are 2 options for datasets to test Text Classification models: CSV datasets or loading HuggingFace Datasets containing the name, subset, split, feature_column and target_column for loading the HF datasets. Here are some details of what these may look like:

CSV Format for Text Classification

Here’s a sample dataset:

text

label

I thoroughly enjoyed Manna from Heaven. The hopes and dreams and perspectives of each of the characters is endearing and we, the audience, get to know each and every one of them, warts and all. And the ending was a great, wonderful and uplifting surprise! Thanks for the experience; I’ll be looking forward to more.

Absolutely nothing is redeeming about this total piece of trash, and the only thing worse than seeing this film is seeing it in English class. This is literally one of the worst films I have ever seen. It totally ignores and contradicts any themes it may present, so the story is just really really dull. Thank god the 80’s are over, and god save whatever man was actually born as “James Bond III”.

For CSV files, we support different variations of the column names. They are shown below :

Supported “text” column names	Supported “label” column names
[‘text’, ‘sentences’, ‘sentence’, ‘sample’]	[‘label’, ‘labels ‘, ‘class’, ‘classes’]

Passing a CSV Text Classification Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task='text-classification',
                  model={'model': 'mrm8488/distilroberta-finetuned-tweets-hate-speech', 'hub':'huggingface'},
                  data={"data_source":'sample.csv'},
                  config='config.yml')

Passing a Hugging Face Dataset for Text Classification to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="text-classification",
                  model={'model': 'mrm8488/distilroberta-finetuned-tweets-hate-speech', 'hub':'huggingface'},
                  data={"data_source":'glue',
                  "subset":"sst2",
                  "feature_column":"sentence",
                  "target_column":'label',
                  "split":"train",
                  "source": "huggingface"
                  })

Question Answering

Question Answering task contains various test-categories, and by default, the question answering task supports robustness, accuracy, fairness, representation, and bias for the benchmark dataset. However, if you want to access a specific sub-task (Category) within the question answering task, it is data-dependent.

Supported test categories and their corresponding supported data inputs are outlined below:

Note: For bias we only support data_source:BoolQ and split:bias

Supported Test Categories	Supported Data
Robustness, Accuracy, Fairness, Representation, Grammar	Benchmark datasets, CSV, HuggingFace Datasets
Bias	BoolQ (split: bias)
Factuality	Factual-Summary-Pairs
Ideology	Curated list
Legal	Legal-Support
Sensitivity	NQ-Open, OpenBookQA, wikiDataset
Stereoset	StereoSet
Sycophancy	synthetic-math-data, synthetic-nlp-data

For the default Question Answering task, the user is meant to select a benchmark dataset. You can see the benchmarks page for all available benchmarks: Benchmarks. You can access the tutorial notebooks to get a quick start on your preferred dataset here: Dataset Notebooks

Passing a Question Answering Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="question-answering", 
                  model={"model": "gpt-3.5-turbo-instruct", "hub":"openai"}, 
                  data={"data_source" :"BBQ", "split":"test-tiny"}, config='config.yml')

Ideology

This test evaluates the model’s political orientation. There is one default dataset used for this test.

Datasets

Dataset	Source	Description	Notebook
Ideology Compass Questions	3 Axis Political Compass Test	Political Compass questions, containing 40 questions for 2 axes.

Passing a Disinformation Dataset to the Harness

In ideology test, the data is automatically loaded since there is only one dataset available for now:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task={"task":"question-answering", "category":"ideology"}, 
            model={'model': "gpt-3.5-turbo-instruct", "hub": "openai"})

Factuality

The Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments. Users should choose a benchmark dataset from the provided list.

Datasets

Dataset	Source	Description	Notebook
Factual-Summary-Pairs	LLAMA-2 is about as factually accurate as GPT-4 for summaries and is 30x cheaper	Factual-Summary-Pairs, containing 371 labeled examples.

Passing a Factuality Test Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"question-answering", "category":"factuality"}, 
                    model = {"model": "gpt-3.5-turbo-instruct", "hub":"openai"},
                    data = {"data_source": "Factual-Summary-Pairs"})

Legal

The Legal test assesses LLMs’ ability to discern the level of support provided by various case summaries for a given legal claim.

Datasets

Dataset	Source	Description	Notebook
legal-support	legal Support Scenario	The legal-support dataset includes 100 labeled examples designed to evaluate models’ performance in discerning the level of support provided by different case summaries for a given legal claim.

Passing a Legal Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"question-answering", "category":"legal"}, 
                    model = {"model": "gpt-3.5-turbo-instruct", "hub":"openai"},
                    data = {"data_source": "legal-support"})

Sensitivity

The Sensitivity Test comprises two distinct evaluations: one focusing on assessing a model’s responsiveness to toxicity, particularly when toxic words are introduced into the input text, and the other aimed at gauging its sensitivity to negations, especially when negations are inserted after verbs like “is,” “was,” “are,” and “were”. Users should choose a benchmark dataset from the provided list.

Test and Dataset Compatibility

Test Name	Supported Dataset	split
Add Toxic Words	wikiDataset	test, test-tiny
Add Negation	NQ-open	test, test-tiny, combined
Add Negation	OpenBookQA	test, test-tiny

Passing a Sensitivity Test Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"question-answering", "category":"sensitivity"}, 
                    model = {"model": "gpt-3.5-turbo-instruct", "hub":"openai"},
                    data={"data_source" :"NQ-open","split":"test-tiny"})

Stereoset

StereoSet test is designed to evaluate the ability of LLMs to measure stereotypical biases in four domains: gender, profession, race, and religion. The dataset consists of pairs of sentences, with one sentence being more stereotypical and the other being anti-stereotypical.

Datasets

Dataset	Source	Description	Notebook
StereoSet	StereoSet: Measuring stereotypical bias in pretrained language models	StereoSet dataset contains 4229 samples. This dataset uses pairs of sentences, where one of them is more stereotypic and the other one is anti-stereotypic.

Passing a Stereoset Math Dataset to the Harness

In the Harness, we specify the data input in the following way:

import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(
    task={"task":"question-answering", "category":"stereoset"},
    model={"model": "bert-base-uncased","hub":"huggingface"},
    data ={"data_source":"StereoSet"})

Sycophancy

Sycophancy is an undesirable behavior where models tailor their responses to align with a human user’s view even when that view is not objectively correct. In this notebook, we propose a simple synthetic data intervention to reduce this behavior in language models.

Test and Dataset Compatibility

Test Name	Supported Dataset	Notebook
sycophancy_math	sycophancy-math-data
sycophancy_nlp	sycophancy-nlp-data

Passing a Sycophancy Math Dataset to the Harness

In the Harness, we specify the data input in the following way:

import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task={"task":"question-answering", "category":"sycophancy"},
                  model={"model": "gpt-3.5-turbo-instruct","hub":"openai"}, 
                  data={"data_source": 'synthetic-math-data',})

Summarization

To test Summarization models, the user is meant to select a benchmark dataset from the available ones: Benchmarks. You can access the tutorial notebooks to get a quick start with your preferred dataset here: Dataset Notebooks

Note: For bias we only support data_source:BoolQ and split:bias

Passing a Summarization Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="summarization", 
                  model={"model": "gpt-3.5-turbo-instruct","hub":"openai"}, 
                  data={"data_source" :"XSum", "split":"test-tiny"},
                  config='config.yml')
   

Passing a Hugging Face Dataset for Summarization to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="summarization", 
                  model={'model': 'gpt-3.5-turbo-instruct', 'hub':'openai'}, 
                  data={"data_source":'samsum',
                  "feature_column":"dialogue",
                  "target_column":'summary',
                  "split":"test",
                  "source": "huggingface"
                  })

Fill Mask

Fill Mask task currently supports only Stereotype test categories. Accessing a specific test within the Stereotype category depends on the dataset. The supported test categories and their corresponding data inputs are outlined below:

Supported Test Category	Supported Data
Stereotype	Wino-test, Crows-Pairs

Stereotype

Stereotype tests play a crucial role in assessing the performance of models when it comes to common gender stereotypes and occupational biases.

Test Name	Supported Dataset	Notebook
wino-bias	Wino-test
crows-pairs	Crows-Pairs

Passing a Wino Bias Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(
                  task={"task": "fill-mask", "category": "wino-bias"}, 
                  model={"model" : "bert-base-uncased", "hub":"huggingface" } ,
                  data ={"data_source":"Wino-test"}
                  )

Passing a Crows Pairs Dataset to the Harness

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(
               task={"task": "fill-mask", "category": "crows-pairs"},
               model={"model" : "bert-base-uncased", "hub":"huggingface" } ,
               data = {"data_source":"Crows-Pairs"}
               )

Text-generation

Text Generation task contains various test-categories. Accessing a specific sub-task (category) within the text generation task depends on the dataset. Supported test categories and their corresponding supported data inputs are outlined below:

Supported Test Category	Supported Data
Clinical	Medical-files, Gastroenterology-files, Oromaxillofacial-files
Disinformation	Narrative-Wedging
Security	Prompt-Injection-Attack
Toxicity	Real Toxicity Prompts

Clinical

Clinical test assesses LLMs’ capability to detect demographic bias, which involves unfair treatment based on factors like age, gender, or race, regardless of patients’ medical conditions.

Datasets

Dataset	Source	Description
Medical-files	curated dataset	Medical-files, containing 49 labeled examples.
Gastroenterology-files	curated dataset	Gastroenterology-files, containing 49 labeled examples.
Oromaxillofacial-files	curated dataset	Oromaxillofacial-files, containing 49 labeled examples.

Passing a Clinical Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

model = {"model": "gpt-3.5-turbo-instruct", "hub": "openai"}

data = {"data_source": "Clinical", "split":"Medical-files"}

task = {"task": "text-generation", "category": "clinical"},

harness = Harness(task=task, model=model, data=data)

Disinformation

This test evaluates the model’s disinformation generation capability. Users should choose a benchmark dataset from the provided list.

Datasets

Dataset	Source	Notebook	Description
Narrative-Wedging	Truth, Lies, and Automation How Language Models Could Change Disinformation	Narrative-Wedging dataset, containing 26 labeled examples.

Passing a Disinformation Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"text-generation", "category":"disinformation"}, 
                    model={"model": "j2-jumbo-instruct", "hub":"ai21"},
                    data = {"data_source": "Narrative-Wedging"})

Security

The Security Test assesses LLMs’ capability to identify and mitigate prompt injection vulnerabilities, which involve malicious prompts attempting to extract personal information or launch attacks on databases.

Datasets

Dataset	Source	Description	Notebook
Prompt-Injection-Attack	curated dataset	Prompt-Injection-Attack, containing 17 examples.

Passing a Security Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

model={'model': "gpt-3.5-turbo-instruct", "hub": "openai"}

data = {"data_source": "Prompt-Injection-Attack", "split":"test"}

task={"task": "text-generation", "category": "security"}

harness = Harness(task=task, model=model, data=data)

Toxicity

This test checks the toxicity of the completion., the user is meant to select a benchmark dataset from the following list:

Datasets

Dataset	Source	Description	Notebook
Toxicity	Real Toxicity Prompts	Truncated set from the Real Toxicity Prompts Dataset, containing 80 examples.

Passing a Toxicity Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task={"task":"text-generation", "category":"toxicity"}, 
                  model={"model": "gpt-3.5-turbo-instruct","hub":"openai"}, 
                  data={"data_source" :'Toxicity', "split":"test"})

Translation

Datasets

Dataset	Source	Description	Notebook
Translation	Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond	Translation, containing 4400 examples.

Passing a Translation Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="translation",
                  model={"model":'t5-base', "hub": "huggingface"},
                  data={"data_source": "Translation"})