langtest.pipelines.transformers.ner_pipeline.NEREnd2EndPipeline#

class NEREnd2EndPipeline(use_cli=True)#

Bases: FlowSpec

NER pipeline for Huggingface models

It executes the following workflow in a sequential order: - train a model on a given dataset - evaluate the model on a given test dataset - test the trained model on a set of tests - augment the training set based on the tests outcome - retrain the model on a the freshly generated augmented training set - evaluate the retrained model on the test dataset - compare the performance of the two models

The pipeline can directly be triggered through the CLI via the following one liner: `bash python3 langtest/pipelines/transformers_pipelines.py run             --model-name="bert-base-uncased"             --train-data=tner.csv             --eval-data=tner.csv             --training-args='{"per_device_train_batch_size": 4, "max_steps": 3}'             --feature-col="tokens"             --target-col="ner_tags" `

__init__(use_cli=True)#

Construct a FlowSpec

Parameters:

use_cli (bool, default: True) – Set to True if the flow is invoked from __main__ or the command line

Methods

__init__([use_cli])

Construct a FlowSpec

augment()

Performs the data augmentation procedure based on langtest

cmd(cmdline[, input, output])

[Legacy function - do not use]

compare()

Performs the comparison between the two trained models

end()

Ending step of the flow (required by Metaflow)

evaluate()

Performs the evaluation procedure on the given test set

foreach_stack()

Returns the current stack of foreach indexes and values for the current step.

merge_artifacts(inputs[, exclude, include])

Helper function for merging artifacts in a join step.

next(*dsts, **kwargs)

Indicates the next step to execute after this step has completed.

reevaluate()

Performs the evaluation procedure of the model training on the augmented dataset

retrain()

Performs the training procedure using the augmented data created by langtest

setup()

Performs all the necessary set up steps

start()

Starting step of the flow (required by Metaflow)

test()

Performs the testing procedure of the model on a set of tests using langtest

train()

Performs the training procedure of the model

Attributes

config

eval_data

feature_col

index

The index of this foreach branch.

input

The value of the foreach artifact in this foreach branch.

model_name

script_name

[Legacy function - do not use.

target_col

train_data

training_args

augment()#

Performs the data augmentation procedure based on langtest

cmd(cmdline, input={}, output=[])#

[Legacy function - do not use]

compare()#

Performs the comparison between the two trained models

end()#

Ending step of the flow (required by Metaflow)

evaluate()#

Performs the evaluation procedure on the given test set

foreach_stack() List[Tuple[int, int, Any]] | None#

Returns the current stack of foreach indexes and values for the current step.

Use this information to understand what data is being processed in the current foreach branch. For example, considering the following code: ``` @step def root(self):

self.split_1 = [‘a’, ‘b’, ‘c’] self.next(self.nest_1, foreach=’split_1’)

@step def nest_1(self):

self.split_2 = [‘d’, ‘e’, ‘f’, ‘g’] self.next(self.nest_2, foreach=’split_2’):

@step def nest_2(self):

foo = self.foreach_stack()

```

foo will take the following values in the various tasks for nest_2: ```

[(0, 3, ‘a’), (0, 4, ‘d’)] [(0, 3, ‘a’), (1, 4, ‘e’)] … [(0, 3, ‘a’), (3, 4, ‘g’)] [(1, 3, ‘b’), (0, 4, ‘d’)] …

``` where each tuple corresponds to:

  • The index of the task for that level of the loop.

  • The number of splits for that level of the loop.

  • The value for that level of the loop.

Note that the last tuple returned in a task corresponds to:

  • 1st element: value returned by self.index.

  • 3rd element: value returned by self.input.

Returns:

An array describing the current stack of foreach steps.

Return type:

List[Tuple[int, int, object]]

property index: int | None#

The index of this foreach branch.

In a foreach step, multiple instances of this step (tasks) will be executed, one for each element in the foreach. This property returns the zero based index of the current task. If this is not a foreach step, this returns None.

If you need to know the indices of the parent tasks in a nested foreach, use FlowSpec.foreach_stack.

Returns:

Index of the task in a foreach step.

Return type:

int, optional

property input: Any | None#

The value of the foreach artifact in this foreach branch.

In a foreach step, multiple instances of this step (tasks) will be executed, one for each element in the foreach. This property returns the element passed to the current task. If this is not a foreach step, this returns None.

If you need to know the values of the parent tasks in a nested foreach, use FlowSpec.foreach_stack.

Returns:

Input passed to the foreach task.

Return type:

object, optional

merge_artifacts(inputs: Inputs, exclude: List[str] | None = None, include: List[str] | None = None) None#

Helper function for merging artifacts in a join step.

This function takes all the artifacts coming from the branches of a join point and assigns them to self in the calling step. Only artifacts not set in the current step are considered. If, for a given artifact, different values are present on the incoming edges, an error will be thrown and the artifacts that conflict will be reported.

As a few examples, in the simple graph: A splitting into B and C and joining in D: ``` A:

self.x = 5 self.y = 6

B:

self.b_var = 1 self.x = from_b

C:

self.x = from_c

D:

merge_artifacts(inputs)

``` In D, the following artifacts are set:

  • y (value: 6), b_var (value: 1)

  • if from_b and from_c are the same, x will be accessible and have value from_b

  • if from_b and from_c are different, an error will be thrown. To prevent this error, you need to manually set self.x in D to a merged value (for example the max) prior to calling merge_artifacts.

Parameters:
  • inputs (Inputs) – Incoming steps to the join point.

  • exclude (List[str], optional) – If specified, do not consider merging artifacts with a name in exclude. Cannot specify if include is also specified.

  • include (List[str], optional) – If specified, only merge artifacts specified. Cannot specify if exclude is also specified.

Raises:
  • MetaflowException – This exception is thrown if this is not called in a join step.

  • UnhandledInMergeArtifactsException – This exception is thrown in case of unresolved conflicts.

  • MissingInMergeArtifactsException – This exception is thrown in case an artifact specified in include cannot be found.

next(*dsts: Callable[[...], None], **kwargs) None#

Indicates the next step to execute after this step has completed.

This statement should appear as the last statement of each step, except the end step.

There are several valid formats to specify the next step:

  • Straight-line connection: self.next(self.next_step) where next_step is a method in the current class decorated with the @step decorator.

  • Static fan-out connection: self.next(self.step1, self.step2, …) where stepX are methods in the current class decorated with the @step decorator.

  • Foreach branch: ` self.next(self.foreach_step, foreach='foreach_iterator') ` In this situation, foreach_step is a method in the current class decorated with the @step decorator and foreach_iterator is a variable name in the current class that evaluates to an iterator. A task will be launched for each value in the iterator and each task will execute the code specified by the step foreach_step.

Parameters:

dsts (Method) – One or more methods annotated with @step.

Raises:

InvalidNextException – Raised if the format of the arguments does not match one of the ones given above.

reevaluate()#

Performs the evaluation procedure of the model training on the augmented dataset

retrain()#

Performs the training procedure using the augmented data created by langtest

property script_name: str#

[Legacy function - do not use. Use current instead]

Returns the name of the script containing the flow

Returns:

A string containing the name of the script

Return type:

str

setup()#

Performs all the necessary set up steps

start()#

Starting step of the flow (required by Metaflow)

test()#

Performs the testing procedure of the model on a set of tests using langtest

train()#

Performs the training procedure of the model