Data

 

The provided code initializes an instance of the Harness class. It accepts a data parameter, which can be specified as a dictionary with the following attributes.

{
   "data_source": "",
   "subset": "",
   "feature_column": "",
   "target_column": "",
   "split": "",
   "source": "huggingface"
}


Key Description
data_source(mandatory) Represents the name of the dataset being used.
subset(optional) Indicates the subset of the dataset being considered.
feature_column(optional) Specifies the column that contains the input features.
target_column(optional) Represents the column that contains the target labels or categories.
split(optional) Denotes which split of the dataset should be used.
source(optional) Set to ‘huggingface’ when loading Hugging Face dataset.

Supported data_source formats are task-dependent. The following table provides an overview of the compatible data sources for each specific task.

Task Supported Data Inputs
ner CoNLL, CSV and HuggingFace Datasets
text-classification CSV and HuggingFace Datsets
question-answering Select list of benchmark datasets or HuggingFace Datsets
summarization Select list of benchmark datasets or HuggingFace Datsets
toxicity Select list of benchmark datasets
clinical-tests Select list of curated datasets
disinformation-test Select list of curated datasets
political Select list of curated datasets
factuality test Select list of curated datasets
sensitivity test Select list of curated datasets

NER

There are three options for datasets to test NER models: CoNLL, CSV and HuggingFace datasets. Here are some details of what these may look like:

CoNLL Format for NER

LEICESTERSHIRE NNP B-NP B-ORG
TAKE           NNP I-NP O
OVER           IN  B-PP O
AT             NNP B-NP O
TOP            NNP I-NP O
AFTER          NNP I-NP O
INNINGS        NNP I-NP O
VICTORY        NNP I-NP O

CSV Format for NER

Supported “text” column names Supported “ner” column names Supported “pos” column names Supported “chunk” column names
[‘text’, ‘sentences’, ‘sentence’, ‘sample’] [‘label’, ‘labels ‘, ‘class’, ‘classes’, ‘ner_tag’, ‘ner_tags’, ‘ner’, ‘entity’] [‘pos_tags’, ‘pos_tag’, ‘pos’, ‘part_of_speech’] [‘chunk_tags’, ‘chunk_tag’]

Passing a NER Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task='ner',
                  model={'model': 'en_core_web_sm', 'hub':'spacy'},
                  data={"data_source":'test.conll'},
                  config='config.yml') #Either of the two formats can be specified.

Passing a Hugging Face Dataset for NER to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="ner",
                  model={"model": "en_core_web_sm", "hub": "spacy"},
                  data={"data_source":'wikiann',
                  "subset":"en",
                  "feature_column":"tokens",
                  "target_column":'ner_tags',
                  "split":"test",
                  "source": "huggingface"
                  })

Text Classification

There are 2 options for datasets to test Text Classification models: CSV datasets or loading HuggingFace Datasets containing the name, subset, split, feature_column and target_column for loading the HF datasets. Here are some details of what these may look like:

CSV Format for Text Classification

Here’s a sample dataset:

text label
I thoroughly enjoyed Manna from Heaven. The hopes and dreams and perspectives of each of the characters is endearing and we, the audience, get to know each and every one of them, warts and all. And the ending was a great, wonderful and uplifting surprise! Thanks for the experience; I’ll be looking forward to more. 1
Absolutely nothing is redeeming about this total piece of trash, and the only thing worse than seeing this film is seeing it in English class. This is literally one of the worst films I have ever seen. It totally ignores and contradicts any themes it may present, so the story is just really really dull. Thank god the 80’s are over, and god save whatever man was actually born as “James Bond III”. 0

For CSV files, we support different variations of the column names. They are shown below :

Supported “text” column names Supported “label” column names
[‘text’, ‘sentences’, ‘sentence’, ‘sample’] [‘label’, ‘labels ‘, ‘class’, ‘classes’]

Passing a CSV Text Classification Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task='text-classification',
                  model={'model': 'mrm8488/distilroberta-finetuned-tweets-hate-speech', 'hub':'huggingface'},
                  data={"data_source":'sample.csv'},
                  config='config.yml')

Passing a Hugging Face Dataset for Text Classification to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="text-classification",
                  model={'model': 'mrm8488/distilroberta-finetuned-tweets-hate-speech', 'hub':'huggingface'},
                  data={"data_source":'glue',
                  "subset":"sst2",
                  "feature_column":"sentence",
                  "target_column":'label',
                  "split":"train",
                  "source": "huggingface"
                  })

Question Answering

To test Question Answering models, the user is meant to select a benchmark dataset. You can see the benchmarks page for all available benchmarks: Benchmarks

You can access the tutorial notebooks to get a quick start on your preferred dataset here: Dataset Notebooks

Passing a Question Answering Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="question-answering", 
                  model={"model": "text-davinci-003", "hub":"openai"}, 
                  data={"data_source" :"BBQ", "split":"test-tiny"}, config='config.yml')

Summarization

To test Summarization models, the user is meant to select a benchmark dataset from the available ones: Benchmarks

You can access the tutorial notebooks to get a quick start with your preferred dataset here: Dataset Notebooks

Passing a Summarization Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="summarization", 
                  model={"model": "text-davinci-003","hub":"openai"}, 
                  data={"data_source" :"XSum", "split":"test-tiny"},
                  config='config.yml')
   

Passing a Hugging Face Dataset for Summarization to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task="summarization", 
                  model={'model': 'text-davinci-003', 'hub':'openai'}, 
                  data={"data_source":'samsum',
                  "feature_column":"dialogue",
                  "target_column":'summary',
                  "split":"test",
                  "source": "huggingface"
                  })

Toxicity

This test checks the toxicity of the completion., the user is meant to select a benchmark dataset from the following list:

Benchmark Datasets

Dataset Source Description
toxicity-test-tiny Real Toxicity Prompts Truncated set from the Real Toxicity Prompts Dataset, containing 80 examples.

Toxicity Benchmarks: Use Cases and Evaluations

Dataset Use Case Notebook
Real Toxicity Prompts Evaluate your model’s accuracy in recognizing and handling toxic language with the Real Toxicity Prompts dataset. It contains real-world prompts from online platforms, ensuring robustness in NLP models to maintain safe environments. Open In Colab

Passing a Toxicity Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task={"task":"text-generation", "category":"toxicity"}, 
                  model={"model": "text-davinci-002","hub":"openai"}, 
                  data={"data_source" :'Toxicity', "split":"test"})

Disinformation Test

This test evaluates the model’s disinformation generation capability. Users should choose a benchmark dataset from the provided list.

Datasets

Dataset Source Description
Narrative-Wedging Truth, Lies, and Automation How Language Models Could Change Disinformation Narrative-Wedging dataset, containing 26 labeled examples.

Disinformation Test Dataset: Use Cases and Evaluations

Dataset Use Case Notebook
Narrative-Wedging Assess the model’s capability to generate disinformation targeting specific groups, often based on demographic characteristics such as race and religion. Open In Colab

Passing a Disinformation Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"text-generation", "category":"disinformation-test"}, 
                    model={"model": "j2-jumbo-instruct", "hub":"ai21"},
                    data = {"data_source": "Narrative-Wedging"})

Ideology Test

This test evaluates the model’s political orientation. There is one default dataset used for this test.

Datasets

Dataset Source Description
Ideology Compass Questions 3 Axis Political Compass Test Political Compass questions, containing 40 questions for 2 axes.

Passing a Disinformation Dataset to the Harness

In ideology test, the data is automatically loaded since there is only one dataset available for now:

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task={"task":"question-answering", "category":"ideology"}, 
            model={'model': "text-davinci-003", "hub": "openai"})

Factuality Test

The Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments. Users should choose a benchmark dataset from the provided list.

Datasets

Dataset Source Description
Factual-Summary-Pairs LLAMA-2 is about as factually accurate as GPT-4 for summaries and is 30x cheaper Factual-Summary-Pairs, containing 371 labeled examples.

Factuality Test Dataset: Use Cases and Evaluations

Dataset Use Case Notebook
Factual-Summary-Pairs The Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments. Open In Colab

Passing a Factuality Test Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"question-answering", "category":"factuality-test"}, 
                    model = {"model": "text-davinci-003", "hub":"openai"},
                    data = {"data_source": "Factual-Summary-Pairs"})

Sensitivity Test

The Sensitivity Test comprises two distinct evaluations: one focusing on assessing a model’s responsiveness to toxicity, particularly when toxic words are introduced into the input text, and the other aimed at gauging its sensitivity to negations, especially when negations are inserted after verbs like “is,” “was,” “are,” and “were”. Users should choose a benchmark dataset from the provided list.

Datasets

Dataset Source Description
NQ-open Natural Questions: A Benchmark for Question Answering Research Training & development set from the NaturalQuestions dataset, containing 3,569 labeled examples
NQ-open-test Natural Questions: A Benchmark for Question Answering Research Development set from the NaturalQuestions dataset, containing 1,769 labeled examples
NQ-open-test-tiny Natural Questions: A Benchmark for Question Answering Research Training, development & test set from the NaturalQuestions dataset, containing 50 labeled examples
OpenBookQA-test OpenBookQA Dataset Testing set from the OpenBookQA dataset, containing 500 multiple-choice elementary-level science questions
OpenBookQA-test-tiny OpenBookQA Dataset Truncated version of the test set from the OpenBookQA dataset, containing 50 multiple-choice examples.
wikiDataset-test wikiDataset Testing set from the wikiDataset, containing 1000 sentences
wikiDataset-test-tiny wikiDataset Truncated version of the test set from the wikiDataset, containing 50 sentences.

Test and Dataset Compatibility

Test Name Supported Dataset Notebook
toxicity wikiDataset-test, wikiDataset-test-tiny Open In Colab
negation NQ-open-test, NQ-open, NQ-open-test-tiny, OpenBookQA-test, OpenBookQA-test-tiny Open In Colab

Passing a Sensitivity Test Dataset to the Harness

In the Harness, we specify the data input in the following way:

# Import Harness from the LangTest library
from langtest import Harness

harness  =  Harness(task={"task":"question-answering", "category":"sensitivity-test"}, 
                    model = {"model": "text-davinci-003", "hub":"openai"},
                    data={"data_source" :"NQ-open","split":"test-tiny"})

Sycophancy Test

Sycophancy is an undesirable behavior where models tailor their responses to align with a human user’s view even when that view is not objectively correct. In this notebook, we propose a simple synthetic data intervention to reduce this behavior in language models.

Test and Dataset Compatibility

Test Name Supported Dataset Notebook
sycophancy_math sycophancy-math-data Open In Colab
sycophancy_nlp sycophancy-nlp-data Open In Colab

Passing a Sycophancy Math Dataset to the Harness

In the Harness, we specify the data input in the following way:

import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Import Harness from the LangTest library
from langtest import Harness

harness = Harness(task={"task":"question-answering", "category":"sycophancy-test"},
                  model={"model": "text-davinci-003","hub":"openai"}, 
                  data={"data_source": 'synthetic-math-data',})
Last updated