Text Classification

 

Text classification is a common task in natural language processing (NLP) where text is categorized into different labels or classes. Many major companies utilize text classification for various practical purposes, integrating it into their production systems. Among the most popular applications is sentiment analysis, which assigns labels like 🙂 positive, 🙁 negative, or 😐 neutral to text sequences.

Supported Test Category Supported Data
Accuracy CSV and HuggingFace Datasets
Bias CSV and HuggingFace Datasets
Fairness CSV and HuggingFace Datasets
Robustness CSV and HuggingFace Datasets
Representation CSV and HuggingFace Datasets
Grammar CSV and HuggingFace Datasets

To get more information about the supported data, click here.

Task Specification

When specifying the task for Text Classification, use the following format:

task: str

task = "ner"

Accuracy

Accuracy testing is crucial for evaluating the overall performance of Text Classification models. It involves measuring how well the models can correctly predict outcomes on a test dataset they have not seen before.

How it works:

  • The model processes original text, producing an actual_result.
  • The expected_result in accuracy is the ground truth, which we will get from the dataset.
  • In the evaluation process, we compare the expected_result and actual_result based on the tests within the accuracy category such as precision, recall, F1 score, micro F1 score, macro F1 score, and weighted F1 score.

Bias

Model bias refers to the phenomenon where the model produces results that are systematically skewed in a particular direction, potentially perpetuating stereotypes or discriminating against specific genders, ethnicities, religions, or countries.

How it works:

test_type original test_case expected_result actual_result pass
replace_to_inter_racial_lastnames with spy kids 2 : the island of lost dreams writer/director/producer robert rodriguez has cobbled together a film that feels like a sugar high gone awry . with spy kids 2 : the island of lost dreams writer/director/producer Ting Hosein has cobbled together a film that feels like a sugar high gone awry . NEGATIVE NEGATIVE True
replace_to_jain_names george , hire a real director and good writers for the next installment , please . Sudharma , hire a real director and good writers for the next installment , please . POSITIVE POSITIVE True
  • Introducing perturbations to the original text, resulting in a perturbed test_case. This perturbation process involves a dictionary-based approach, where gender names, ethnicity names, religion names, or country names are randomly swapped. Additionally, users have the flexibility to provide their own custom data or append data to the existing dictionary, allowing for greater control over these tests.
  • The model processes both the original and perturbed inputs, resulting in expected_result and actual_result respectively.
  • During evaluation, the predicted labels in the expected and actual results are compared to assess the model’s performance.

Fairness

Fairness testing is a critical aspect of evaluating the performance of a machine learning model, especially when the model has potential implications for specific groups of people. Fairness testing aims to ensure that the model is not biased towards or against any particular group and that it produces unbiased results for all groups.

How it works:

  • We categorize the input data into three groups: “male,” “female,” and “unknown.”
  • The model processes original sentence from the categorized input data, producing an actual_result for each sample.
  • The expected_result in fairness is the ground truth that we get from the dataset.
  • During evaluation, the predicted labels in the expected and actual results are compared for each group to assess the model’s performance.

Grammar

The Grammar Test assesses NLP models’ proficiency in intelligently identifying and correcting intentional grammatical errors. This ensures refined language understanding and enhances overall processing quality, contributing to the model’s linguistic capabilities.

How it works:

test_type original test_case expected_result actual_result pass
paraphrase This is one of those movies that showcases a great actor’s talent and also conveys a great story. It is one of Stewart’s greatest movies. Barring a few historic errors it also does an excellent job of telling the story of the “Spirit of St. Louis”. This film is one of those movies that showcases a great actor and also conveys a great story. Stewart considers it one of his greatest accomplishments. Despite a few historical mistakes, it still manages to tell the story of the “Spirit of St. Louis”. POSITIVE POSITIVE True
paraphrase I thoroughly enjoyed Manna from Heaven. The hopes and dreams and perspectives of each of the characters is endearing and we, the audience, get to know each and every one of them, warts and all. And the ending was a great, wonderful and uplifting surprise! Thanks for the experience; I’ll be looking forward to more. Manna from Heaven was a wonderful movie, with all the cute and unexpected things in life. Each character has such a unique and heartfelt personality, and the ending is a surprise that I couldn’t have asked for in the past. POSITIVE POSITIVE True
  • During the perturbation process, we paraphrase the original text, resulting in a perturbed test_case.
  • The model processes both the original and perturbed inputs, resulting in expected_result and actual_result respectively.
  • During evaluation, the predicted labels in the expected and actual results are compared to assess the model’s performance.

Representation

The goal of representation testing is to assess whether a dataset accurately represents a specific population or if biases within it could adversely affect the results of any analysis.

How it works:

  • From the dataset, it extracts the original sentence.
  • Subsequently, it employs a classifier or dictionary is used to determine representation proportion and representation count based on the applied test. This includes calculating gender names, ethnicity names, religion names, or country names according to the applied test criteria. Additionally, users have the flexibility to provide their own custom data or append data to the existing dictionary, allowing for greater control over these tests.

Robustness

Robustness testing aims to evaluate the ability of a model to maintain consistent performance when faced with various perturbations or modifications in the input data.

How it works:

test_type original test_case expected_result actual_result pass
add_ocr_typo director rob marshall went out gunning to make a great one . diredor rob marshall went o^ut gunning t^o makc a grcat o^ne . POSITIVE NEGATIVE False
uppercase an amusing , breezily apolitical documentary about life on the campaign trail . AN AMUSING , BREEZILY APOLITICAL DOCUMENTARY ABOUT LIFE ON THE CAMPAIGN TRAIL . POSITIVE POSITIVE True
  • Perturbations, such as lowercase, uppercase, typos, etc., are introduced to the original text, resulting in a perturbed test_case.
  • The model processes both the original and perturbed inputs, resulting in expected_result and actual_result respectively.
  • During evaluation, the predicted labels in the expected and actual results are compared to assess the model’s performance
Last updated