Generating Augmentations

Proportional Augmentation

Proportional Augmentation is based on robustness and bias tests. It can be used to improve data quality by employing various testing methods that modify or generate new data based on a set of training data.

The .augment() function takes the following parameters:

training_data (dict): (Required) Specifies the source of the original training data. It should be a dictionary containing the necessary information about the dataset.
save_data_path (str): (Required) Name of the file to store the augmented data. The augmented dataset will be saved in this file.
export_mode (str): (Optional) Specifies how the augmented data should be exported. The possible values are:
- inplace: Modifies the list of samples in place.
- add: Adds new samples to the input data.
- transformed: Exports only the transformed data, excluding different untransformed samples.
custom_proportions (dict): (Optional) custom_proportions is a dictionary with augmentation on test type as key and proportion as value. The proportion is the percentage of the test cases that will be augmented with the given augmentation type. By default , pass_rate and minimum_pass_rate from the report for the provided model to calculate the propotion

Proportion Increase Rates

The following table outlines the classification of proportion increase rates based on the comparison of the minimum pass rate with the pass rate figures (“x”).

Range	Assigned Value	Interpretation
x ≥ 1	Undefined	Not applicable
0.9 ≤ x < 1	0.05	Moderate increase
0.8 ≤ x < 0.9	0.1	Relatively higher increase
0.7 ≤ x < 0.8	0.2	Notable increase
x < 0.7	0.3	Default increase rate

custom_proportions = {
    'add_typo':0.3,
    'lowercase':0.3
}

data_kwargs = {
      "data_source" : "conll03.conll",
       }

h.augment(
    training_data = data_kwargs,
    save_data_path ="augmented_conll03.conll",
    custom_proportions=custom_proportions,
    export_mode="transformed")

This method applies perturbations to the input data based on the recommendations from the Harness report. This augmented dataset can then be used to retrain a model so as to make it more robust than its previous version.

Passing a Hugging Face Dataset for Augmentation

For Augmentations, we specify the HuggingFace data input in the following way:

custom_proportions = {
    'add_ocr_typo':0.3
}

data_kwargs = {
      "data_source" : "glue",
      "subset": "sst2",
      "feature_column": "sentence",
      "target_column": "label",
      "split": "train",
      "source": "huggingface"
       }

h.augment(
    training_data = data_kwargs,
    save_data_path ="augmented_glue.csv",
    custom_proportions=custom_proportions,
    export_mode="add",
)

Templatic Augmentations

Templatic Augmentation is a technique that allows you to generate new training data by applying a set of predefined templates to the original training data. The templates are designed to introduce noise into the training data in a way that simulates real-world conditions.

Templatic Augmentations

Templatic augmentation is controlled by templates to be used with training data to be augmented. The augmentation process is performed by the augment() method of the Harness class.

template = ["The {ORG} company is located in {LOC}", "The {ORG} company is located in {LOC} and is owned by {PER}"]

The .augment() function takes the following parameters:

training_data (dict): (Required) Specifies the source of the original training data. It should be a dictionary containing the necessary information about the dataset.
save_data_path (str): (Required) Name of the file to store the augmented data. The augmented dataset will be saved in this file.
templates (list): List of templates(string) or conll file to be used for augmentation.
generate_templates (bool): if set to True, generates sample templates from given ones.
show_templates (bool): if set to True, displays the used templates.

data_kwargs = {
      "data_source" : "conll03.conll",
       }
openai.api_key = "YOUR OPENAI KEY"
harness.augment(
    training_data=data_kwargs,
    save_data_path='augmented_conll03.conll',
    templates=["The {ORG} company is located in {LOC}"],
    generate_templates = True,
    show_templates = True,
    )