Data

 

data: dict

The provided code initializes an instance of the Harness class. It accepts a data parameter, which can be specified as a dictionary with the following attributes.

{
   "data_source": "",
   "subset": "",
   "feature_column": "",
   "target_column": "",
   "split": "",
   "source": "huggingface"
}


Key Description
data_source(mandatory) Represents the name of the dataset being used.
subset(optional) Indicates the subset of the dataset being considered.
feature_column(optional) Specifies the column that contains the input features.
target_column(optional) Represents the column that contains the target labels or categories.
split(optional) Denotes which split of the dataset should be used.
source(optional) Set to ‘huggingface’ when loading Hugging Face dataset.

Supported File formats

The following table provides an overview of the compatible data sources for each specific task.

Task Supported Data Inputs
ner CoNLL, CSV and HuggingFace Datasets
text-classification CSV and HuggingFace Datsets
question-answering benchmark datasets, curated datasets, CSV, HuggingFace Datsets
summarization benchmark datasets, CSV, HuggingFace Datsets
fill-mask curated datasets
translation curated datasets
text-generation curated datasets

Note: data_source formats are task and category dependent.

NER

There are three options for datasets to test NER models: CoNLL, CSV and HuggingFace datasets. Here are some details of what these may look like:

CoNLL Format for NER

LEICESTERSHIRE NNP B-NP B-ORG
TAKE           NNP I-NP O
OVER           IN  B-PP O
AT             NNP B-NP O
TOP            NNP I-NP O
AFTER          NNP I-NP O
INNINGS        NNP I-NP O
VICTORY        NNP I-NP O

CSV Format for NER

Supported “text” column names Supported “ner” column names Supported “pos” column names Supported “chunk” column names
[‘text’, ‘sentences’, ‘sentence’, ‘sample’] [‘label’, ‘labels ‘, ‘class’, ‘classes’, ‘ner_tag’, ‘ner_tags’, ‘ner’, ‘entity’] [‘pos_tags’, ‘pos_tag’, ‘pos’, ‘part_of_speech’] [‘chunk_tags’, ‘chunk_tag’]

Passing a NER Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task='ner',
                  model={'model': 'en_core_web_sm', 'hub':'spacy'},
                  data={"data_source":'test.conll'},
                  config='config.yml') #Either of the two formats can be specified.

Passing a Hugging Face Dataset for NER to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="ner",
                  model={"model": "en_core_web_sm", "hub": "spacy"},
                  data={"data_source":'wikiann',
                  "subset":"en",
                  "feature_column":"tokens",
                  "target_column":'ner_tags',
                  "split":"test",
                  "source": "huggingface"
                  })

Text Classification

There are 2 options for datasets to test Text Classification models: CSV datasets or loading HuggingFace Datasets containing the name, subset, split, feature_column and target_column for loading the HF datasets. Here are some details of what these may look like:

CSV Format for Text Classification

Here’s a sample dataset:

text label
I thoroughly enjoyed Manna from Heaven. The hopes and dreams and perspectives of each of the characters is endearing and we, the audience, get to know each and every one of them, warts and all. And the ending was a great, wonderful and uplifting surprise! Thanks for the experience; I’ll be looking forward to more. 1
Absolutely nothing is redeeming about this total piece of trash, and the only thing worse than seeing this film is seeing it in English class. This is literally one of the worst films I have ever seen. It totally ignores and contradicts any themes it may present, so the story is just really really dull. Thank god the 80’s are over, and god save whatever man was actually born as “James Bond III”. 0

For CSV files, we support different variations of the column names. They are shown below :

Supported “text” column names Supported “label” column names
[‘text’, ‘sentences’, ‘sentence’, ‘sample’] [‘label’, ‘labels ‘, ‘class’, ‘classes’]

Passing a CSV Text Classification Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task='text-classification',
                  model={'model': 'mrm8488/distilroberta-finetuned-tweets-hate-speech', 'hub':'huggingface'},
                  data={"data_source":'sample.csv'},
                  config='config.yml')

Passing a Hugging Face Dataset for Text Classification to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="text-classification",
                  model={'model': 'mrm8488/distilroberta-finetuned-tweets-hate-speech', 'hub':'huggingface'},
                  data={"data_source":'glue',
                  "subset":"sst2",
                  "feature_column":"sentence",
                  "target_column":'label',
                  "split":"train",
                  "source": "huggingface"
                  })

Question Answering

Question Answering task contains various test-categories, and by default, the question answering task supports robustness, accuracy, fairness, representation, and bias for the benchmark dataset. However, if you want to access a specific sub-task (Category) within the question answering task, it is data-dependent.

Supported test categories and their corresponding supported data inputs are outlined below:

Note: For bias we only support data_source:BoolQ and split:bias

Supported Test Categories Supported Data
Robustness, Accuracy, Fairness, Representation, Grammar Benchmark datasets, CSV, HuggingFace Datasets
Bias BoolQ (split: bias)
Factuality Factual-Summary-Pairs
Ideology Curated list
Legal Legal-Support
Sensitivity NQ-Open, OpenBookQA, wikiDataset
Stereoset StereoSet
Sycophancy synthetic-math-data, synthetic-nlp-data

For the default Question Answering task, the user is meant to select a benchmark dataset. You can see the benchmarks page for all available benchmarks: Benchmarks. You can access the tutorial notebooks to get a quick start on your preferred dataset here: Dataset Notebooks

Passing a Question Answering Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="question-answering", 
                  model={"model": "gpt-3.5-turbo-instruct", "hub":"openai"}, 
                  data={"data_source" :"BBQ", "split":"test-tiny"}, config='config.yml')

Ideology

This test evaluates the model’s political orientation. There is one default dataset used for this test.

Datasets

Dataset Source Description Notebook
Ideology Compass Questions 3 Axis Political Compass Test Political Compass questions, containing 40 questions for 2 axes. Open In Colab

Passing a Disinformation Dataset to the Harness

In ideology test, the data is automatically loaded since there is only one dataset available for now:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task={"task":"question-answering", "category":"ideology"}, 
            model={'model': "gpt-3.5-turbo-instruct", "hub": "openai"})

Factuality

The Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments. Users should choose a benchmark dataset from the provided list.

Datasets

Dataset Source Description Notebook
Factual-Summary-Pairs LLAMA-2 is about as factually accurate as GPT-4 for summaries and is 30x cheaper Factual-Summary-Pairs, containing 371 labeled examples. Open In Colab

Passing a Factuality Test Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"question-answering", "category":"factuality"}, 
                    model = {"model": "gpt-3.5-turbo-instruct", "hub":"openai"},
                    data = {"data_source": "Factual-Summary-Pairs"})

The Legal test assesses LLMs’ ability to discern the level of support provided by various case summaries for a given legal claim.

Datasets

Dataset Source Description Notebook
legal-support legal Support Scenario The legal-support dataset includes 100 labeled examples designed to evaluate models’ performance in discerning the level of support provided by different case summaries for a given legal claim. Open In Colab

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"question-answering", "category":"legal"}, 
                    model = {"model": "gpt-3.5-turbo-instruct", "hub":"openai"},
                    data = {"data_source": "legal-support"})

Sensitivity

The Sensitivity Test comprises two distinct evaluations: one focusing on assessing a model’s responsiveness to toxicity, particularly when toxic words are introduced into the input text, and the other aimed at gauging its sensitivity to negations, especially when negations are inserted after verbs like “is,” “was,” “are,” and “were”. Users should choose a benchmark dataset from the provided list.

Test and Dataset Compatibility

Test Name Supported Dataset split Notebook
Add Toxic Words wikiDataset test, test-tiny Open In Colab
Add Negation NQ-open test, test-tiny, combined Open In Colab
Add Negation OpenBookQA test, test-tiny Open In Colab

Passing a Sensitivity Test Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"question-answering", "category":"sensitivity"}, 
                    model = {"model": "gpt-3.5-turbo-instruct", "hub":"openai"},
                    data={"data_source" :"NQ-open","split":"test-tiny"})

Stereoset

StereoSet test is designed to evaluate the ability of LLMs to measure stereotypical biases in four domains: gender, profession, race, and religion. The dataset consists of pairs of sentences, with one sentence being more stereotypical and the other being anti-stereotypical.

Datasets

Dataset Source Description Notebook
StereoSet StereoSet: Measuring stereotypical bias in pretrained language models StereoSet dataset contains 4229 samples. This dataset uses pairs of sentences, where one of them is more stereotypic and the other one is anti-stereotypic. Open In Colab

Passing a Stereoset Math Dataset to the Harness

In the Harness, we specify the data input in the following way:

import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(
    task={"task":"question-answering", "category":"stereoset"},
    model={"model": "bert-base-uncased","hub":"huggingface"},
    data ={"data_source":"StereoSet"})

Sycophancy

Sycophancy is an undesirable behavior where models tailor their responses to align with a human user’s view even when that view is not objectively correct. In this notebook, we propose a simple synthetic data intervention to reduce this behavior in language models.

Test and Dataset Compatibility

Test Name Supported Dataset Notebook
sycophancy_math sycophancy-math-data Open In Colab
sycophancy_nlp sycophancy-nlp-data Open In Colab

Passing a Sycophancy Math Dataset to the Harness

In the Harness, we specify the data input in the following way:

import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task={"task":"question-answering", "category":"sycophancy"},
                  model={"model": "gpt-3.5-turbo-instruct","hub":"openai"}, 
                  data={"data_source": 'synthetic-math-data',})

Summarization

To test Summarization models, the user is meant to select a benchmark dataset from the available ones: Benchmarks. You can access the tutorial notebooks to get a quick start with your preferred dataset here: Dataset Notebooks

Note: For bias we only support data_source:BoolQ and split:bias

Passing a Summarization Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="summarization", 
                  model={"model": "gpt-3.5-turbo-instruct","hub":"openai"}, 
                  data={"data_source" :"XSum", "split":"test-tiny"},
                  config='config.yml')
   

Passing a Hugging Face Dataset for Summarization to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="summarization", 
                  model={'model': 'gpt-3.5-turbo-instruct', 'hub':'openai'}, 
                  data={"data_source":'samsum',
                  "feature_column":"dialogue",
                  "target_column":'summary',
                  "split":"test",
                  "source": "huggingface"
                  })

Fill Mask

Fill Mask task currently supports only Stereotype test categories. Accessing a specific test within the Stereotype category depends on the dataset. The supported test categories and their corresponding data inputs are outlined below:

Supported Test Category Supported Data
Stereotype Wino-test, Crows-Pairs

Stereotype

Stereotype tests play a crucial role in assessing the performance of models when it comes to common gender stereotypes and occupational biases.

Test Name Supported Dataset Notebook
wino-bias Wino-test Open In Colab
crows-pairs Crows-Pairs Open In Colab

Passing a Wino Bias Dataset to the Harness

In the Harness, we specify the data input in the following way:


# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(
                  task={"task": "fill-mask", "category": "wino-bias"}, 
                  model={"model" : "bert-base-uncased", "hub":"huggingface" } ,
                  data ={"data_source":"Wino-test"}
                  )

Passing a Crows Pairs Dataset to the Harness


# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(
               task={"task": "fill-mask", "category": "crows-pairs"},
               model={"model" : "bert-base-uncased", "hub":"huggingface" } ,
               data = {"data_source":"Crows-Pairs"}
               )

Text-generation

Text Generation task contains various test-categories. Accessing a specific sub-task (category) within the text generation task depends on the dataset. Supported test categories and their corresponding supported data inputs are outlined below:

Supported Test Category Supported Data
Clinical Medical-files, Gastroenterology-files, Oromaxillofacial-files
Disinformation Narrative-Wedging
Security Prompt-Injection-Attack
Toxicity Real Toxicity Prompts

Clinical

Clinical test assesses LLMs’ capability to detect demographic bias, which involves unfair treatment based on factors like age, gender, or race, regardless of patients’ medical conditions.

Datasets

Dataset Source Description Notebook
Medical-files curated dataset Medical-files, containing 49 labeled examples. Open In Colab
Gastroenterology-files curated dataset Gastroenterology-files, containing 49 labeled examples. Open In Colab
Oromaxillofacial-files curated dataset Oromaxillofacial-files, containing 49 labeled examples. Open In Colab

Passing a Clinical Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

model = {"model": "gpt-3.5-turbo-instruct", "hub": "openai"}

data = {"data_source": "Clinical", "split":"Medical-files"}

task = {"task": "text-generation", "category": "clinical"},

harness = Harness(task=task, model=model, data=data)

Disinformation

This test evaluates the model’s disinformation generation capability. Users should choose a benchmark dataset from the provided list.

Datasets

Dataset Source Notebook Description
Narrative-Wedging Truth, Lies, and Automation How Language Models Could Change Disinformation Narrative-Wedging dataset, containing 26 labeled examples. Open In Colab

Passing a Disinformation Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"text-generation", "category":"disinformation"}, 
                    model={"model": "j2-jumbo-instruct", "hub":"ai21"},
                    data = {"data_source": "Narrative-Wedging"})

Security

The Security Test assesses LLMs’ capability to identify and mitigate prompt injection vulnerabilities, which involve malicious prompts attempting to extract personal information or launch attacks on databases.

Datasets

Dataset Source Description Notebook
Prompt-Injection-Attack curated dataset Prompt-Injection-Attack, containing 17 examples. Open In Colab

Passing a Security Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

model={'model': "gpt-3.5-turbo-instruct", "hub": "openai"}

data = {"data_source": "Prompt-Injection-Attack", "split":"test"}

task={"task": "text-generation", "category": "security"}

harness = Harness(task=task, model=model, data=data)

Toxicity

This test checks the toxicity of the completion., the user is meant to select a benchmark dataset from the following list:

Datasets

Dataset Source Description Notebook
Toxicity Real Toxicity Prompts Truncated set from the Real Toxicity Prompts Dataset, containing 80 examples. Open In Colab

Passing a Toxicity Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task={"task":"text-generation", "category":"toxicity"}, 
                  model={"model": "gpt-3.5-turbo-instruct","hub":"openai"}, 
                  data={"data_source" :'Toxicity', "split":"test"})

Translation

Datasets

Dataset Source Description Notebook
Translation Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Translation, containing 4400 examples. Open In Colab

Passing a Translation Dataset to the Harness

In the Harness, we specify the data input in the following way:


# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="translation",
                  model={"model":'t5-base', "hub": "huggingface"},
                  data={"data_source": "Translation"})
Last updated