langtest.datahandler.datasource.CSVDataset#

class CSVDataset(file_path: str | Dict, task: TaskManager, **kwargs)#

Bases: BaseDataset

__init__(file_path: str | Dict, task: TaskManager, **kwargs) None#

Initializes a CustomCSVDataset object.

Parameters:
  • file_path (Union[str, Dict]) – The path to the data file or a dictionary containing the following keys: - “data_source”: The path to the data file. - “feature_column” (optional): Specifies the column containing input features. - “target_column” (optional): Specifies the column containing target labels.

  • task (str) – Specifies the task of the dataset, which can be one of the following: - “text-classification” - “ner” (Named Entity Recognition) - “question-answering” - “summarization”

  • **kwargs – Additional keyword arguments that can be used to configure the dataset (optional).

Methods

__init__(file_path, task, **kwargs)

Initializes a CustomCSVDataset object.

export_data(data, output_path)

Exports the data to the corresponding format and saves it to 'output_path'.

load_data()

Load data from a CSV file and preprocess it based on the specified task.

load_raw_data([standardize_columns])

Loads data from a csv file into raw lists of strings

Attributes

COLUMN_NAMES

A class to handle CSV files datasets.

data_sources

supported_tasks

COLUMN_NAMES = {'crows-pairs': {'mask1': ['mask1'], 'mask2': ['mask2'], 'sentence': ['sentence']}, 'ner': {'chunk': ['chunk_tags', 'chunk_tag'], 'ner': ['label', 'labels ', 'class', 'classes', 'ner_tag', 'ner_tags', 'ner', 'entity'], 'pos': ['pos_tags', 'pos_tag', 'pos', 'part_of_speech'], 'text': ['text', 'sentences', 'sentence', 'sample', 'tokens']}, 'question-answering': {'answer': ['answer', 'answer_and_def_correct_predictions', 'ground_truth'], 'context': ['context', 'passage', 'contract'], 'options': ['options'], 'text': ['question']}, 'summarization': {'summary': ['summary'], 'text': ['text', 'document']}, 'text-classification': {'label': ['label', 'labels ', 'class', 'classes'], 'text': ['text', 'sentences', 'sentence', 'sample']}}#

A class to handle CSV files datasets. Subclass of BaseDataset.

_file_path#

The path to the data file or a dictionary containing “data_source” key with the path.

Type:

Union[str, Dict]

task#

Specifies the task of the dataset, which can be either “text-classification”,”ner” “question-answering” and “summarization”.

Type:

str

delimiter#

The delimiter used in the CSV file to separate columns (only for file_path as str).

Type:

str

export_data(data: List[Sample], output_path: str)#

Exports the data to the corresponding format and saves it to ‘output_path’.

Parameters:
  • data (List[Sample]) – data to export

  • output_path (str) – path to save the data to

load_data() List[Sample]#

Load data from a CSV file and preprocess it based on the specified task.

Returns:

A list of preprocessed data samples.

Return type:

List[Sample]

Raises:

ValueError – If the specified task is unsupported.

Note

  • If ‘is_import’ is set to True in the constructor’s keyword arguments,

the data will be imported using the specified ‘file_path’ and optional ‘column_map’ for renaming columns.

  • If ‘is_import’ is set to False (default), the data will be loaded from

a CSV file specified in ‘file_path’, and the ‘column_map’ will be automatically matched with the dataset columns.

  • The supported task types are: ‘text-classification’, ‘ner’,

‘summarization’, and ‘question-answering’. The appropriate task-specific loading function will be invoked to preprocess the data.

load_raw_data(standardize_columns: bool = False) List[Dict]#

Loads data from a csv file into raw lists of strings

Parameters:

standardize_columns (bool) – whether to standardize column names

Returns:

parsed CSV file into list of dicts

Return type:

List[Dict]