The provided code initializes an instance of the Harness class. It accepts a data parameter, which can be specified as a dictionary
with the following attributes.
{
"data_source": "",
"subset": "",
"feature_column": "",
"target_column": "",
"split": "",
"source": "huggingface"
}
Key | Description |
---|---|
data_source(mandatory) | Represents the name of the dataset being used. |
subset(optional) | Indicates the subset of the dataset being considered. |
feature_column(optional) | Specifies the column that contains the input features. |
target_column(optional) | Represents the column that contains the target labels or categories. |
split(optional) | Denotes which split of the dataset should be used. |
source(optional) | Set to ‘huggingface’ when loading Hugging Face dataset. |
Supported data_source
formats are task-dependent. The following table provides an overview of the compatible data sources for each specific task.
Task | Supported Data Inputs |
---|---|
ner | CoNLL, CSV and HuggingFace Datasets |
text-classification | CSV and HuggingFace Datsets |
question-answering | Select list of benchmark datasets or HuggingFace Datsets |
summarization | Select list of benchmark datasets or HuggingFace Datsets |
toxicity | Select list of benchmark datasets |
clinical-tests | Select list of curated datasets |
disinformation-test | Select list of curated datasets |
political | Select list of curated datasets |
factuality test | Select list of curated datasets |
sensitivity test | Select list of curated datasets |
NER
There are three options for datasets to test NER models: CoNLL
, CSV
and HuggingFace datasets. Here are some details of what these may look like:
CoNLL Format for NER
LEICESTERSHIRE NNP B-NP B-ORG
TAKE NNP I-NP O
OVER IN B-PP O
AT NNP B-NP O
TOP NNP I-NP O
AFTER NNP I-NP O
INNINGS NNP I-NP O
VICTORY NNP I-NP O
CSV Format for NER
Supported “text” column names | Supported “ner” column names | Supported “pos” column names | Supported “chunk” column names |
---|---|---|---|
[‘text’, ‘sentences’, ‘sentence’, ‘sample’] | [‘label’, ‘labels ‘, ‘class’, ‘classes’, ‘ner_tag’, ‘ner_tags’, ‘ner’, ‘entity’] | [‘pos_tags’, ‘pos_tag’, ‘pos’, ‘part_of_speech’] | [‘chunk_tags’, ‘chunk_tag’] |
Passing a NER Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task='ner',
model={'model': 'en_core_web_sm', 'hub':'spacy'},
data={"data_source":'test.conll'},
config='config.yml') #Either of the two formats can be specified.
Passing a Hugging Face Dataset for NER to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="ner",
model={"model": "en_core_web_sm", "hub": "spacy"},
data={"data_source":'wikiann',
"subset":"en",
"feature_column":"tokens",
"target_column":'ner_tags',
"split":"test",
"source": "huggingface"
})
Text Classification
There are 2 options for datasets to test Text Classification models: CSV
datasets or loading HuggingFace Datasets
containing the name, subset, split, feature_column and target_column for loading the HF datasets. Here are some details of what these may look like:
CSV Format for Text Classification
Here’s a sample dataset:
text | label |
---|---|
I thoroughly enjoyed Manna from Heaven. The hopes and dreams and perspectives of each of the characters is endearing and we, the audience, get to know each and every one of them, warts and all. And the ending was a great, wonderful and uplifting surprise! Thanks for the experience; I’ll be looking forward to more. | 1 |
Absolutely nothing is redeeming about this total piece of trash, and the only thing worse than seeing this film is seeing it in English class. This is literally one of the worst films I have ever seen. It totally ignores and contradicts any themes it may present, so the story is just really really dull. Thank god the 80’s are over, and god save whatever man was actually born as “James Bond III”. | 0 |
For CSV
files, we support different variations of the column names. They are shown below :
Supported “text” column names | Supported “label” column names |
---|---|
[‘text’, ‘sentences’, ‘sentence’, ‘sample’] | [‘label’, ‘labels ‘, ‘class’, ‘classes’] |
Passing a CSV Text Classification Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task='text-classification',
model={'model': 'mrm8488/distilroberta-finetuned-tweets-hate-speech', 'hub':'huggingface'},
data={"data_source":'sample.csv'},
config='config.yml')
Passing a Hugging Face Dataset for Text Classification to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="text-classification",
model={'model': 'mrm8488/distilroberta-finetuned-tweets-hate-speech', 'hub':'huggingface'},
data={"data_source":'glue',
"subset":"sst2",
"feature_column":"sentence",
"target_column":'label',
"split":"train",
"source": "huggingface"
})
Question Answering
To test Question Answering models, the user is meant to select a benchmark dataset. You can see the benchmarks page for all available benchmarks: Benchmarks
You can access the tutorial notebooks to get a quick start on your preferred dataset here: Dataset Notebooks
Passing a Question Answering Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="question-answering",
model={"model": "text-davinci-003", "hub":"openai"},
data={"data_source" :"BBQ", "split":"test-tiny"}, config='config.yml')
Summarization
To test Summarization models, the user is meant to select a benchmark dataset from the available ones: Benchmarks
You can access the tutorial notebooks to get a quick start with your preferred dataset here: Dataset Notebooks
Passing a Summarization Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="summarization",
model={"model": "text-davinci-003","hub":"openai"},
data={"data_source" :"XSum", "split":"test-tiny"},
config='config.yml')
Passing a Hugging Face Dataset for Summarization to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task="summarization",
model={'model': 'text-davinci-003', 'hub':'openai'},
data={"data_source":'samsum',
"feature_column":"dialogue",
"target_column":'summary',
"split":"test",
"source": "huggingface"
})
Toxicity
This test checks the toxicity of the completion., the user is meant to select a benchmark dataset from the following list:
Benchmark Datasets
Dataset | Source | Description |
---|---|---|
toxicity-test-tiny | Real Toxicity Prompts | Truncated set from the Real Toxicity Prompts Dataset, containing 80 examples. |
Toxicity Benchmarks: Use Cases and Evaluations
Passing a Toxicity Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"text-generation", "category":"toxicity"},
model={"model": "text-davinci-002","hub":"openai"},
data={"data_source" :'Toxicity', "split":"test"})
Disinformation Test
This test evaluates the model’s disinformation generation capability. Users should choose a benchmark dataset from the provided list.
Datasets
Dataset | Source | Description |
---|---|---|
Narrative-Wedging | Truth, Lies, and Automation How Language Models Could Change Disinformation | Narrative-Wedging dataset, containing 26 labeled examples. |
Disinformation Test Dataset: Use Cases and Evaluations
Dataset | Use Case | Notebook |
---|---|---|
Narrative-Wedging | Assess the model’s capability to generate disinformation targeting specific groups, often based on demographic characteristics such as race and religion. |
Passing a Disinformation Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"text-generation", "category":"disinformation-test"},
model={"model": "j2-jumbo-instruct", "hub":"ai21"},
data = {"data_source": "Narrative-Wedging"})
Ideology Test
This test evaluates the model’s political orientation. There is one default dataset used for this test.
Datasets
Dataset | Source | Description |
---|---|---|
Ideology Compass Questions | 3 Axis Political Compass Test | Political Compass questions, containing 40 questions for 2 axes. |
Passing a Disinformation Dataset to the Harness
In ideology test, the data is automatically loaded since there is only one dataset available for now:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"question-answering", "category":"ideology"},
model={'model': "text-davinci-003", "hub": "openai"})
Factuality Test
The Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments. Users should choose a benchmark dataset from the provided list.
Datasets
Dataset | Source | Description |
---|---|---|
Factual-Summary-Pairs | LLAMA-2 is about as factually accurate as GPT-4 for summaries and is 30x cheaper | Factual-Summary-Pairs, containing 371 labeled examples. |
Factuality Test Dataset: Use Cases and Evaluations
Passing a Factuality Test Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"question-answering", "category":"factuality-test"},
model = {"model": "text-davinci-003", "hub":"openai"},
data = {"data_source": "Factual-Summary-Pairs"})
Sensitivity Test
The Sensitivity Test comprises two distinct evaluations: one focusing on assessing a model’s responsiveness to toxicity, particularly when toxic words are introduced into the input text, and the other aimed at gauging its sensitivity to negations, especially when negations are inserted after verbs like “is,” “was,” “are,” and “were”. Users should choose a benchmark dataset from the provided list.
Datasets
Dataset | Source | Description |
---|---|---|
NQ-open | Natural Questions: A Benchmark for Question Answering Research | Training & development set from the NaturalQuestions dataset, containing 3,569 labeled examples |
NQ-open-test | Natural Questions: A Benchmark for Question Answering Research | Development set from the NaturalQuestions dataset, containing 1,769 labeled examples |
NQ-open-test-tiny | Natural Questions: A Benchmark for Question Answering Research | Training, development & test set from the NaturalQuestions dataset, containing 50 labeled examples |
OpenBookQA-test | OpenBookQA Dataset | Testing set from the OpenBookQA dataset, containing 500 multiple-choice elementary-level science questions |
OpenBookQA-test-tiny | OpenBookQA Dataset | Truncated version of the test set from the OpenBookQA dataset, containing 50 multiple-choice examples. |
wikiDataset-test | wikiDataset | Testing set from the wikiDataset, containing 1000 sentences |
wikiDataset-test-tiny | wikiDataset | Truncated version of the test set from the wikiDataset, containing 50 sentences. |
Test and Dataset Compatibility
Test Name | Supported Dataset | Notebook |
---|---|---|
toxicity | wikiDataset-test, wikiDataset-test-tiny | |
negation | NQ-open-test, NQ-open, NQ-open-test-tiny, OpenBookQA-test, OpenBookQA-test-tiny |
Passing a Sensitivity Test Dataset to the Harness
In the Harness, we specify the data input in the following way:
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"question-answering", "category":"sensitivity-test"},
model = {"model": "text-davinci-003", "hub":"openai"},
data={"data_source" :"NQ-open","split":"test-tiny"})
Sycophancy Test
Sycophancy is an undesirable behavior where models tailor their responses to align with a human user’s view even when that view is not objectively correct. In this notebook, we propose a simple synthetic data intervention to reduce this behavior in language models.
Test and Dataset Compatibility
Test Name | Supported Dataset | Notebook |
---|---|---|
sycophancy_math | sycophancy-math-data | |
sycophancy_nlp | sycophancy-nlp-data |
Passing a Sycophancy Math Dataset to the Harness
In the Harness, we specify the data input in the following way:
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"
# Import Harness from the LangTest library
from langtest import Harness
harness = Harness(task={"task":"question-answering", "category":"sycophancy-test"},
model={"model": "text-davinci-003","hub":"openai"},
data={"data_source": 'synthetic-math-data',})