Ner

Named Entity Recognition (NER) is a natural language processing (NLP) technique that involves identifying and classifying named entities within text. Named entities are specific pieces of information such as names of people, places, organizations, dates, times, quantities, monetary values, percentages, and more. NER plays a crucial role in understanding and extracting meaningful information from unstructured text data.

The primary goal of Named Entity Recognition is to locate and label these named entities in a given text. By categorizing them into predefined classes, NER algorithms make it possible to extract structured information from text, turning it into a more usable format for various applications like information retrieval, text mining, question answering, sentiment analysis, and more.

Supported Test Category	Supported Data
Accuracy	CoNLL, CSV and HuggingFace Datasets
Bias	CoNLL, CSV and HuggingFace Datasets
Fairness	CoNLL, CSV and HuggingFace Datasets
Robustness	CoNLL, CSV and HuggingFace Datasets
Representation	CoNLL, CSV and HuggingFace Datasets

To get more information about the supported data, click here.

Task Specification

When specifying the task for Named Entity Recognition (NER), use the following format:

task: str

task = "ner"

Accuracy

Accuracy testing is crucial for evaluating the overall performance of NER models. It involves measuring how well the models can correctly predict outcomes on a test dataset they have not seen before.

How it works:

The model processes original text, producing an actual_result.
The expected_result in accuracy is the ground truth, which we will get from the dataset.
In the evaluation process, we compare the expected_result and actual_result based on the tests within the accuracy category such as precision, recall, F1 score, micro F1 score, macro F1 score, and weighted F1 score.

Bias

Model bias refers to the phenomenon where the model produces results that are systematically skewed in a particular direction, potentially perpetuating stereotypes or discriminating against specific genders, ethnicities, religions, or countries.

How it works:

test_type	original	test_case	expected_result	actual_result	pass
replace_to_low_income_country	Japan coach Shu Kamo said : ‘ ‘ The Syrian own goal proved lucky for us .	Zambia coach Shu Kamo said : ‘ ‘ The Syrian own goal proved lucky for us .	Japan: LOC, Shu Kamo: PER, Syrian: MISC	Zambia: LOC, Shu Kamo: PER, Syrian: MISC	True
replace_to_high_income_country	Two goals from defensive errors in the last six minutes allowed Japan to come from behind and collect all three points from their opening meeting against Syria .	Two goals from defensive errors in the last six minutes allowed Japan to come from behind and collect all three points from their opening meeting against Sint Maarten .	Japan: LOC, Syria: LOC	Japan: LOC, Sint Maarten: PER	False

Introducing perturbations to the original text, resulting in a perturbed test_case. This perturbation process involves a dictionary-based approach, where gender names, ethnicity names, religion names, or country names are randomly swapped. Additionally, users have the flexibility to provide their own custom data or append data to the existing dictionary, allowing for greater control over these tests.
It’s important to note that when we add perturbations to the original text, we also track the span of the words that are perturbed. This allows us to determine the indices of those words, simplifying the process of realigning the original text with the perturbed text.
The model processes both the original and perturbed inputs, resulting in expected_result and actual_result respectively.
During evaluation, the predicted entities in the expected and actual results are compared to assess the model’s performance.

Fairness

Fairness testing is a critical aspect of evaluating the performance of a machine learning model, especially when the model has potential implications for specific groups of people. Fairness testing aims to ensure that the model is not biased towards or against any particular group and that it produces unbiased results for all groups.

How it works:

We categorize the input data into three groups: “male,” “female,” and “unknown.”
The model processes original sentence from the categorized input data, producing an actual_result for each sample.
The expected_result in fairness is the ground truth that we get from the dataset.
During evaluation, the predicted entities in the expected and actual results are compared for each group to assess the model’s performance.

Representation

The goal of representation testing is to assess whether a dataset accurately represents a specific population or if biases within it could adversely affect the results of any analysis.

How it works:

From the dataset, it extracts the original sentence.
Subsequently, it employs a classifier or dictionary is used to determine representation proportion and representation count based on the applied test. This includes calculating gender names, ethnicity names, religion names, or country names according to the applied test criteria. Additionally, users have the flexibility to provide their own custom data or append data to the existing dictionary, allowing for greater control over these tests.

Robustness

Robustness testing aims to evaluate the ability of a model to maintain consistent performance when faced with various perturbations or modifications in the input data.

How it works:

test_type	original	test_case	expected_result	actual_result	pass
add_ocr_typo	” He ended the World Cup on the wrong note , “ Coste said .	” He ended t^e woiUd Cup on t^e v/rong notc , “ Coste said .	World Cup: MISC, Coste: PER	woiUd Cup: MISC, Coste: PER	True
uppercase	Despite winning the Asian Games title two years ago , Uzbekistan are in the finals as outsiders .	DESPITE WINNING THE ASIAN GAMES TITLE TWO YEARS AGO , UZBEKISTAN ARE IN THE FINALS AS OUTSIDERS .	Asian Games: MISC, Uzbekistan: LOC	ASIAN: MISC, UZBEKISTAN: LOC	False

Perturbations, such as lowercase, uppercase, typos, etc., are introduced to the original text, resulting in a perturbed test_case.
It’s important to note that when we add perturbations to the original text, we also track the span of the words that are perturbed. This allows us to determine the indices of those words, simplifying the process of realigning the original text with the perturbed text.
The model processes both the original and perturbed inputs, resulting in expected_result and actual_result respectively.
During evaluation, the predicted entities in the expected and actual results are compared to assess the model’s performance