Robustness testing aims to evaluate the ability of a model to maintain consistent performance when faced with various perturbations or modifications in the input data.

How it works:

category test_type original test_case expected_result actual_result eval_score pass
robustness uppercase The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation - a charity to raise money for Nigerian sport.\nMr Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42.\nAppearing at the Old Bailey earlier, all four….. THE EX-READING DEFENDER DENIED FRAUDULENT TRADING CHARGES RELATING TO THE SODJE SPORTS FOUNDATION - A CHARITY TO RAISE MONEY FOR NIGERIAN SPORT. MR SODJE, 37, IS JOINTLY CHARGED WITH ELDER BROTHERS EFE, 44, BRIGHT, 50 AND STEPHEN, 42. APPEARING AT THE OLD BAILEY EARLIER, ALL FOUR… Ex-Reading defender denies fraudulent trading charges linked to Nigerian sports charity. Sam, Efe, Bright, and Stephen Sodje, brothers and co-founders of the charity, pleaded not guilty. The alleged offences occurred between 2008 and 2014. The trial is set for July and the Sodje Four brothers, including former Reading defender Sam Sodje, have denied charges of fraudulent trading in relation to the Sodje Sports Foundation, a charity to raise money for Nigerian sports. The alleged offences occurred between 2008 and 2014 and the trial is set for July. The brothers were released on bail 0.565591 True
  • Introducing perturbations to the original_context and original_question, resulting in perturbed_context and perturbed_question.
  • The model processes both the original and perturbed inputs, resulting in expected_result and actual_result respectively.

Evaluation Criteria

The model processes expected_result and original_question from the categorized input data, producing an actual_result for each sample.

Input Data: Provide the model with summary.

Model Response: Examine the short content generated by the model in response to the summary.

Evaluation using Rouge or BERTScore:

  • If the chosen metric is Rouge, compute Rouge scores for predictions and references.
  • If the chosen metric is BERTScore, compute BERTScore for predictions and references.

Comparison with Threshold:

  • Compare the computed evaluation score with the threshold specified in the configuration. ( Default Threshold: 0.50)
  • If the score meets or exceeds the threshold, consider the evaluation to have passed.


The primary goal of bias tests is to explore how replacing documents with different genders, ethnicities, religions, or countries impacts the model’s predictions compared to the original training set.

How it works:

To evaluate Summarization Models for, we use the one of the two datasets XSum split bias

  • No perturbations are added to the inputs, as the dataset already contains those perturbations.
  • The model processes these original inputs and perturbed inputs, producing an expected_result and actual_result respectively.
  • The evaluation process employs the same approach used in robustness.


Accuracy testing is vital for evaluating a NLP model’s performance. It gauges the model’s ability to predict outcomes on an unseen test dataset by comparing predicted and ground truth.

How it works:

  • The model processes original_context and original_question, producing an actual_result.
  • The expected_result in accuracy is the ground truth.
  • In the evaluation process, we compare the expected_result and actual_result based on the tests within the accuracy category.


The primary goal of fairness testing is to assess the performance of a NLP model without bias, particularly in contexts with implications for specific groups. This testing ensures that the model does not favor or discriminate against any group, aiming for unbiased results across all groups.

How it works:

  • We categorize the input data into three groups: “male,” “female,” and “unknown.”
  • The model processes original_context and original_question from the categorized input data, producing an actual_result for each sample.
  • The expected_result in fairness is the ground truth that we get from the dataset.
  • In the evaluation process, we compare the expected_result and actual_result for each group based on the tests within the fairness category.


The goal of representation testing is to assess whether a dataset accurately represents a specific population or if biases within it could adversely affect the results of any analysis.

How it works:

  • From the dataset, it extracts the original_question and original_context.
  • Subsequently, it employs a classifier or dictionary is used to determine representation proportion and representation count based on the applied test. This includes calculating gender names, ethnicity names, religion names, or country names according to the applied test criteria. Additionally, users have the flexibility to provide their own custom data or append data to the existing dictionary, allowing for greater control over these tests.