1.5.0
đ˘ Highlights
LangTest 1.5.0 Release by John Snow Labs đ: We are delighted to announce remarkable enhancements and updates in our latest release of LangTest 1.5.0. Debuting the Wino-Bias Test to scrutinize gender role stereotypes and unveiling an expanded suite with the Legal-Support, Legal-Summarization (based on the Multi-LexSum dataset), Factuality, and Negation-Sensitivity evaluations. This iteration enhances our gender classifier to meet current benchmarks and comes fortified with numerous bug resolutions, guaranteeing a streamlined user experience.
đĽ New Features
Adding support for wino-bias test
This test is specifically designed for Hugging Face fill-mask models like BERT, RoBERTa-base, and similar models. Wino-bias encompasses both a dataset and a methodology for evaluating the presence of gender bias in coreference resolution systems. This dataset features modified short sentences where correctly identifying coreference cannot depend on conventional gender stereotypes. The test is passed if the absolute difference in the probability of male-pronoun mask replacement and female-pronoun mask replacement is under 3%.
⤠Notebook Link:
⤠How the test looks ?
Adding support for legal-support test
The LegalSupport dataset evaluates fine-grained reverse entailment. Each sample consists of a text passage making a legal claim, and two case summaries. Each summary describes a legal conclusion reached by a different court. The task is to determine which case (i.e. legal conclusion) most forcefully and directly supports the legal claim in the passage. The construction of this benchmark leverages annotations derived from a legal taxonomy expliciting different levels of entailment (e.g. âdirectly supportsâ vs âindirectly supportsâ). As such, the benchmark tests a modelâs ability to reason regarding the strength of support a particular case summary provides.
⤠Notebook Link:
⤠How the test looks ?
Adding support for factuality test
The Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments.
Test Objective
The primary goal of the Factuality Test is to assess how well LLMs can identify the factual accuracy of summary sentences. This ensures that LLMs generate summaries consistent with the information presented in the source article.
Data Source
For this test, we utilize the Factual-Summary-Pairs dataset, which is sourced from the following GitHub repository: Factual-Summary-Pairs Dataset.
Methodology
Our test methodology draws inspiration from a reference article titled âLLAMA-2 is about as factually accurate as GPT-4 for summaries and is 30x cheaperâ.
Bias Identification
We identify bias in the responses based on specific patterns:
- Bias Towards A: Occurs when both the âresultâ and âswapped_resultâ are âA.â This bias is in favor of âA,â but itâs incorrect, so itâs marked as False.
- Bias Towards B: Occurs when both the âresultâ and âswapped_resultâ are âB.â This bias is in favor of âB,â but itâs incorrect, so itâs marked as False.
- No Bias : When âresultâ is âBâ and âswapped_resultâ is âA,â there is no bias. However, this statement is incorrect, so itâs marked as False.
- No Bias : When âresultâ is âAâ and âswapped_resultâ is âB,â there is no bias. This statement is correct, so itâs marked as True.
Accuracy Assessment
Accuracy is assessed by examining the âpassâ column. If âpassâ is marked as True, it indicates a correct response. Conversely, if âpassâ is marked as False, it indicates an incorrect response.
⤠Notebook Link:
⤠How the test looks ?
Adding support for negation sensitivity test
In this evaluation, we investigate how a model responds to negations introduced into input text. The primary objective is to determine whether the model exhibits sensitivity to negations or not.
-
Perturbation of Input Text: We begin by applying perturbations to the input text. Specifically, we add negations after specific verbs such as âis,â âwas,â âare,â and âwere.â
-
Model Behavior Examination: After introducing these negations, we feed both the original input text and the transformed text into the model. The aim is to observe the modelâs behavior when confronted with input containing negations.
-
Evaluation of Model Outputs:
openai
Hub: If the model is hosted under the âopenaiâ hub, we proceed by calculating the embeddings of both the original and transformed output text. We assess the modelâs sensitivity to negations using the formula:Sensitivity = (1 - Cosine Similarity)
.
huggingface
Hub: In the case where the model is hosted under the âhuggingfaceâ hub, we first retrieve both the model and the tokenizer from the hub. Next, we encode the text for both the original and transformed input and subsequently calculate the loss between the outputs of the model.
By following these steps, we can gauge the modelâs sensitivity to negations and assess whether it accurately understands and responds to linguistic nuances introduced by negation words.
⤠Notebook Link:
⤠How the test looks ?
We have used threshold of (-0.1,0.1) . If the eval_score falls within this threshold range, it indicates that the model is failing to properly handle negations, implying insensitivity to linguistic nuances introduced by negation words.
Adding support for legal-summarization test
MultiLexSum
Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities
Dataset Summary
The Multi-LexSum dataset consists of legal case summaries. The aim is for the model to thoroughly examine the given context and, upon understanding its content, produce a concise summary that captures the essential themes and key details.
⤠Notebook Link:
⤠How the test looks ?
The default threshold value is 0.50. If the eval_score is higher than threshold, then the âpassâ will be as true.
đ Bug Fixes
- False negatives in some tests
- Bias Testing for QA and Summarization