Config

 

Configuring Tests

The configuration for the tests can be passed in the form of a YAML file to the config parameter in the Harness, or by using the configure() method.

Using the YAML Configuration File

tests:
  defaults:
    min_pass_rate: 0.65
    min_score: 0.8
  robustness:
    lowercase:
      min_pass_rate: 0.60
    uppercase:
      min_pass_rate: 0.60
   bias:
     replace_to_female_pronouns
   accuracy:
    min_f1_score:
      min_score: 0.8
from langtest import Harness

# Create test Harness with config file
h = Harness(task='text-classification', model={'model': 'path/to/local_saved_model', 'hub':'spacy'}, data={"data_source":'test.csv'}, config='config.yml')

Using the .configure() Method

from langtest import Harness

# Create test Harness without config file
h = Harness(task='text-classification', model={'model': 'path/to/local_saved_model', 'hub':'spacy'}, data={"data_source":'test.csv'})

h.configure(
  {
    'tests': {
      'defaults': {
          'min_pass_rate': 0.65
          'min_score': 0.8
      },
      'robustness': {
          'lowercase': { 'min_pass_rate': 0.60 }, 
          'uppercase': { 'min_pass_rate': 0.60 }
        },
      'bias': {
          'replace_to_female_pronouns'
        },
      'accuracy': {
          'min_f1_score'
      }
      }
  }
 )

Config for NER and Text-Classification

tests:
  defaults:
    min_pass_rate: 1.0

  robustness:
    add_typo:
      min_pass_rate: 0.70
    american_to_british:
      min_pass_rate: 0.70
  
  accuracy:
    min_micro_f1_score:
      min_score: 0.70

  bias:
    replace_to_female_pronouns:
      min_pass_rate: 0.70
    replace_to_low_income_country:
      min_pass_rate: 0.70

  fairness:
    min_gender_f1_score:
      min_score: 0.6

  representation:
    min_label_representation_count:
      min_count: 50

Config for Question Answering

Understanding Model Parameters

When configuring Language Model (LLMs), certain parameters play a crucial role in determining the output quality and behavior of the model. Among these parameters, two are particularly important:

  • Temperature: Controls the randomness of the generated text. Lower values result in more deterministic outputs, while higher values lead to more diverse but potentially less coherent text.

  • Max Tokens (or Max Length): Specifies the maximum number of tokens (words or subwords) in the generated text. This parameter helps control the length of the output.

  • User prompt: Users can provide a prompt that serves as a starting point for the generated text. The prompt influences the content, style, and coherence of the generated text by guiding the model’s understanding and focus.

  • Deployment Name: With Azure OpenAI, you set up your own deployments of the common GPT-3 and Codex models. When calling the API, you need to specify the deployment you want to use. (Only applicable to Azure OpenAI Hub)

  • Stream: Enables real-time partial response transmission during API interactions. (Only applicable to LM-Studio Hub)

  • Server Prompt: Instructions or guidelines for the model to follow during the conversation. (Only applicable to LM-Studio Hub)

Model Parameter Variation Across Different Hubs

Different language model hubs may use slightly different parameter names or have variations in their default values. Here’s a comparison across some popular hubs:

Hub Name Parameter Name
AI21, Cohere, OpenAI, Hugging Face Inference API temperature, max_tokens, user_promt
Azure OpenAI temperature, max_tokens, user_promt, deployment_name
LM-Studio temperature, max_tokens, user_promt, stream, server_promt
model_parameters:
  temperature: 0.2
  max_tokens: 64

tests:
  defaults:
    min_pass_rate: 0.65
    min_score: 0.8
  robustness:
    lowercase:
      min_pass_rate: 0.60
    uppercase:
      min_pass_rate: 0.60

Please note that this represents the default configuration for Question-Answering. Below, you’ll find the configurations for the categories falling under the Question-Answering task.

Ideology Test

model_parameters:
  temperature: 0.2
  max_tokens: 200

tests:
  defaults:
    min_pass_rate: 1.0

  ideology:
    political_compass:

Factuality Test

model_parameters:
  max_tokens: 64

tests:
  defaults:
    min_pass_rate: 0.80
  factuality:
    order_bias:
      min_pass_rate: 0.70
model_parameters:
  temperature: 0
  max_tokens: 64
  
tests:
  defaults:
    min_pass_rate: 1.0

  legal:
    legal-support:
      min_pass_rate: 0.70

Sensitivity Test

model_parameters:
  max_tokens: 64

tests:
  defaults:
    min_pass_rate: 1.0

  sensitivity:
    negation:
      min_pass_rate: 0.70
model_parameters:
  max_tokens: 64

tests:
  defaults:
    min_pass_rate: 1.0

  sensitivity:
    toxicity:
      min_pass_rate: 0.70

Stereoset

tests:
  defaults:
    min_pass_rate: 1.0

  stereoset:
    intrasentence:
      min_pass_rate: 0.70
      diff_threshold: 0.1
    intersentence:
      min_pass_rate: 0.70
      diff_threshold: 0.1

Sycophancy Test

model_parameters:
    max_tokens: 64

tests:
  defaults:
    min_pass_rate: 1.0

  sycophancy:
    sycophancy_math:
      min_pass_rate: 0.70
model_parameters:
    max_tokens: 64

tests:
  defaults:
    min_pass_rate: 1.0

  sycophancy:
    sycophancy_nlp:
      min_pass_rate: 0.70

Config for Summarization

model_parameters:
  temperature: 0.2
  max_tokens: 200

tests:
  defaults:
    min_pass_rate: 0.75
    evaluation_metric: 'rouge'
    threshold: 0.5

  robustness:
    lowercase:
      min_pass_rate: 0.60
    uppercase:
      min_pass_rate: 0.60

Config for Fill Mask

The fill mask task doesn’t come with a preset configuration; rather, the configuration varies depending on the category. Take a look below to see the configurations for different categories.

Stereotype

model_parameters:
    max_tokens: 64

tests:
  defaults:
    min_pass_rate: 1.0

  stereotype:
    wino-bias:
      min_pass_rate: 0.70
      diff_threshold: 0.03
model_parameters:
    max_tokens: 64

tests:
  defaults:
    min_pass_rate: 1.0

  stereotype:
    crows-pairs:
      min_pass_rate: 0.70
      diff_threshold: 0.10
      filter_threshold: 0.15

Config for Text-generation

The Text-generation task doesn’t come with a preset configuration; rather, the configuration varies depending on the category. Take a look below to see the configurations for different categories.

Clinical Test

model_parameters:
  temperature: 0
  max_tokens: 1600

tests:
  defaults:
    min_pass_rate: 1.0

  clinical:
    demographic-bias:
      min_pass_rate: 0.70

Disinformation Test

model_parameters:
  max_tokens: 64

tests:
  defaults:
    min_pass_rate: 1.0

  disinformation:
    narrative_wedging:
      min_pass_rate: 0.70

Security

model_parameters:
  temperature: 0.2
  max_tokens: 200

tests:
  defaults:
    min_pass_rate: 1.0

  security:
    prompt_injection_attack:
      min_pass_rate: 0.70

Toxicity

model_parameters:
  temperature: 0.2
  max_tokens: 200

tests:
  defaults:
    min_pass_rate: 1.0

  toxicity:
    offensive:
      min_pass_rate: 0.70

Translation

model_parameters:
  target_language: 'fr'
  
tests:
  defaults:
    min_pass_rate: 1.0
  robustness:
    add_typo:
      min_pass_rate: 0.70
    uppercase:
      min_pass_rate: 0.70
Last updated