langtest.langtest.Harness#

class Harness(task: str | dict, model: list | dict | None = None, data: list | dict | None = None, config: str | dict | None = None, benchmarking: dict | None = None)#

Bases: object

Harness is a testing class for NLP models.

Harness class evaluates the performance of a given NLP model. Given test data is used to test the model. A report is generated with test results.

__init__(task: str | dict, model: list | dict | None = None, data: list | dict | None = None, config: str | dict | None = None, benchmarking: dict | None = None)#

Initialize the Harness object.

Parameters:
  • task (str, optional) – Task for which the model is to be evaluated.

  • model (list | dict, optional) – Specifies the model to be evaluated. If provided as a list, each element should be a dictionary with ‘model’ and ‘hub’ keys. If provided as a dictionary, it must contain ‘model’ and ‘hub’ keys when specifying a path.

  • data (dict, optional) – The data to be used for evaluation.

  • config (str | dict, optional) – Configuration for the tests to be performed.

Raises:

ValueError – Invalid arguments.

Methods

__init__(task[, model, data, config, ...])

Initialize the Harness object.

augment(training_data, save_data_path[, ...])

Augments the data in the input file located at input_path and saves the result to output_path.

available_tests([test_type])

Returns a dictionary of available tests categorized by test type.

configure(config)

Configure the Harness with a given configuration.

edit_testcases(output_path, **kwargs)

Testcases are exported to a csv file to be edited.

generate([seed])

Generate the testcases to be used when evaluating the model.

generated_results()

Generates an overall report with every textcase and labelwise metrics.

get_leaderboard([indices, columns, ...])

Get the rank of the model on the leaderboard.

import_edited_testcases(input_path, **kwargs)

Testcases are imported from a csv file

load(save_dir, task[, model, ...])

Loads a previously saved Harness from a given configuration and dataset

load_checkpoints(task, model, ...)

Load checkpoints and other necessary data to recreate a Harness object.

model_response([category])

Retrieves the model response for a specific category.

pass_custom_data(file_path[, test_name, ...])

Load custom data from a JSON file and store it in a class variable.

report([format, save_dir, mlflow_tracking])

Generate a report of the test results.

run([checkpoint, batch_size, ...])

Run the tests on the model using the generated test cases.

save(save_dir[, include_generated_results])

Save the configuration, generated testcases and the DataFactory to be reused later.

testcases()

Testcases after .generate() is called

upload_file_to_hub(repo_type, file_path, token)

Uploads a file or a Dataset to the Hugging Face Model Hub.

upload_folder_to_hub(repo_type, folder_path, ...)

Uploads a folder containing a model or dataset to the Hugging Face Model Hub or Dataset Hub.

Attributes

DEFAULTS_CONFIG

DEFAULTS_DATASET

SUPPORTED_HUBS

SUPPORTED_HUBS_HF_DATASET_CLASSIFICATION

SUPPORTED_HUBS_HF_DATASET_LLM

SUPPORTED_HUBS_HF_DATASET_NER

SUPPORTED_TASKS

augment(training_data: dict, save_data_path: str, custom_proportions: List | Dict | None = None, export_mode: str = 'add', templates: str | List[str] | None = None, append_original: bool = False, generate_templates: bool = False, show_templates: bool = False) Harness#

Augments the data in the input file located at input_path and saves the result to output_path.

Parameters:
  • training_data (dict) – A dictionary containing the input data for augmentation.

  • save_data_path (str) – Path to save the augmented data.

  • custom_proportions (Union[Dict, List]) –

  • export_mode (str, optional) – Determines how the samples are modified or exported. - ‘inplace’: Modifies the list of samples in place. - ‘add’: Adds new samples to the input data. - ‘transformed’: Exports only the transformed data, excluding untransformed samples. Defaults to ‘add’.

  • templates (Optional[Union[str, List[str]]]) –

  • append_original (bool, optional) – If set to True, appends the original data to the augmented data. Defaults to False.

  • generate_templates (bool, optional) – if set to True, generates sample templates from given ones.

  • show_templates (bool, optional) – if set to True, displays the used templates.

Returns:

The instance of the class calling this method.

Return type:

Harness

Raises:

ValueError – If the pass_rate or minimum_pass_rate columns have an unexpected data type.

Note

This method uses an instance of AugmentRobustness to perform the augmentation.

static available_tests(test_type: str | None = None) Dict[str, List[str]]#

Returns a dictionary of available tests categorized by test type.

Parameters:

test_type (str, optional) – The specific test type to retrieve. Defaults to None.

Returns:

Returns a dictionary containing available tests for the specified test type and defaults to all available tests.

Return type:

dict

Raises:

ValueError – If an invalid test type is provided.

configure(config: str | dict) dict#

Configure the Harness with a given configuration.

Parameters:

config (str | dict) – Configuration file path or dictionary for the tests to be performed.

Returns:

Loaded configuration.

Return type:

dict

edit_testcases(output_path: str, **kwargs)#

Testcases are exported to a csv file to be edited.

The edited file can be imported back to the harness

Parameters:

output_path (str) – path to save the testcases to

generate(seed: int | None = None) Harness#

Generate the testcases to be used when evaluating the model.

The generated testcases are stored in _testcases attribute.

generated_results() DataFrame | None#

Generates an overall report with every textcase and labelwise metrics.

Returns:

Generated dataframe.

Return type:

pd.DataFrame

get_leaderboard(indices=[], columns=[], category=False, split_wise=False, test_wise=False, rank_by: str | list = 'Avg', *args, **kwargs)#

Get the rank of the model on the leaderboard.

import_edited_testcases(input_path: str, **kwargs)#

Testcases are imported from a csv file

Parameters:

input_path (str) – location of the file to load

classmethod load(save_dir: str, task: str, model: list | dict | None = None, load_testcases: bool = False, load_model_response: bool = False) Harness#

Loads a previously saved Harness from a given configuration and dataset

Parameters:
  • save_dir (str) – path to folder containing all the needed files to load an saved Harness

  • task (str) – task for which the model is to be evaluated.

  • model (Union[list, dict], optional) – Specifies the model to be evaluated. If provided as a list, each element should be a dictionary with ‘model’ and ‘hub’ keys. If provided as a dictionary, it must contain ‘model’ and ‘hub’ keys when specifying a path.

  • hub (str, optional) – model hub to load from the path. Required if path is passed as ‘model’.

Returns:

Harness loaded from from a previous configuration along with the new model to evaluate

Return type:

Harness

classmethod load_checkpoints(task, model, save_checkpoints_dir: str) Harness#

Load checkpoints and other necessary data to recreate a Harness object.

Parameters:
  • task – The task for which the model was tested.

  • model – The model or models used for testing.

  • save_checkpoints_dir (str) – Directory containing saved checkpoints and data.

Returns:

A Harness object reconstructed with loaded checkpoints and data.

Return type:

Harness

Raises:

OSError – Raised if necessary files (config.yaml, data.pkl) are missing in the checkpoint directory.

model_response(category: str | None = None)#

Retrieves the model response for a specific category.

Parameters:

category (str) – The category for which the model response is requested. It should be one of the supported categories: “accuracy” or “fairness”.

Returns:

A DataFrame containing the model response data with columns including ‘gender’, ‘original’,

’original_question’, ‘original_context’, ‘options’, ‘expected_results’, and ‘actual_results’. If the model response is empty or None, returns an empty DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If the category is None or not one of the supported categories.

pass_custom_data(file_path: str, test_name: str | None = None, task: str | None = None, append: bool = False) None#

Load custom data from a JSON file and store it in a class variable.

Parameters:
  • file_path (str) – Path to the JSON file.

  • test_name (str, optional) – Name parameter. Defaults to None.

  • task (str, optional) – Task type. Either “bias” or “representation”. Defaults to None.

  • append (bool, optional) – Whether to append the data or overwrite it. Defaults to False.

report(format: str = 'dataframe', save_dir: str | None = None, mlflow_tracking: bool = False) DataFrame#

Generate a report of the test results.

Parameters:
  • format (str) – format in which to save the report

  • save_dir (str) – name of the directory to save the file

Returns:

DataFrame containing the results of the tests.

Return type:

pd.DataFrame

run(checkpoint: bool = False, batch_size=500, save_checkpoints_dir: str = 'checkpoints') Harness#

Run the tests on the model using the generated test cases.

Parameters:
  • checkpoint (bool) – If True, enable checkpointing to save intermediate results.

  • batch_size (int) – Batch size for dividing test cases into batches.

  • save_checkpoints_dir (str) – Directory to save checkpoints and intermediate results.

Returns:

The updated Harness object with test results stored in generated_results attribute.

Return type:

Harness

Raises:

RuntimeError – Raised if test cases are not provided (None).

save(save_dir: str, include_generated_results: bool = False) None#

Save the configuration, generated testcases and the DataFactory to be reused later.

Parameters:

save_dir (str) – path to folder to save the different files

Returns:

testcases() DataFrame#

Testcases after .generate() is called

Returns:

testcases formatted into a pd.DataFrame

Return type:

pd.DataFrame

upload_file_to_hub(repo_type: str, file_path: str, token: str, exist_ok: bool = False, split: str = 'train')#

Uploads a file or a Dataset to the Hugging Face Model Hub.

Parameters:
  • repo_name (str) – The name of the repository in the format ‘username/repository’.

  • repo_type (str) – The type of the repository, e.g: ‘dataset’ or ‘model’.

  • file_path (str) – Path to the file to be uploaded.

  • token (str) – Hugging Face Hub authentication token.

  • exist_ok (bool, optional) – If True, do not raise an error if repo already exists.

  • split (str, optional) – The split of the dataset. Defaults to ‘train’.

Raises:
  • ValueError – Raised if a valid token is not provided.

  • ModuleNotFoundError – Raised if required packages are not installed.

Returns:

None

upload_folder_to_hub(repo_type: str, folder_path: str, token: str, model_type: str = 'huggingface', exist_ok: bool = False)#

Uploads a folder containing a model or dataset to the Hugging Face Model Hub or Dataset Hub.

This function facilitates the process of uploading a local folder containing a model or dataset to the Hugging Face Model Hub or Dataset Hub. It requires proper authentication through a valid token.

Parameters:
  • repo_name (str) – The name of the repository on the Hub.

  • repo_type (str) – The type of the repository, either “model” or “dataset”.

  • folder_path (str) – The local path to the folder containing the model or dataset files to be uploaded.

  • token (str) – The authentication token for accessing the Hugging Face Hub services.

  • model_type (str, optional) – The type of the model, currently supports “huggingface” and “spacy”. Defaults to “huggingface”.

  • exist_ok (bool, optional) – If True, do not raise an error if repo already exists.

Raises:
  • ValueError – If a valid token is not provided for Hugging Face Hub authentication.

  • ModuleNotFoundError – If required package is not installed. This package needs to be installed based on model_type (“huggingface” or “spacy”).