langtest.datahandler.datasource.CSVDataset#

class CSVDataset(file_path: str | Dict, task: TaskManager, **kwargs)#

Bases: BaseDataset

__init__(file_path: str | Dict, task: TaskManager, **kwargs) → None#

Initializes a CustomCSVDataset object.

Parameters:

file_path (Union[str, Dict]) – The path to the data file or a dictionary containing the following keys: - “data_source”: The path to the data file. - “feature_column” (optional): Specifies the column containing input features. - “target_column” (optional): Specifies the column containing target labels.
task (str) – Specifies the task of the dataset, which can be one of the following: - “text-classification” - “ner” (Named Entity Recognition) - “question-answering” - “summarization”
**kwargs – Additional keyword arguments that can be used to configure the dataset (optional).

Methods

`__init__`(file_path, task, **kwargs)	Initializes a CustomCSVDataset object.
`export_data`(data, output_path)	Exports the data to the corresponding format and saves it to 'output_path'.
`load_data`()	Load data from a CSV file and preprocess it based on the specified task.
`load_raw_data`([standardize_columns])	Loads data from a csv file into raw lists of strings

Attributes

`COLUMN_NAMES`	A class to handle CSV files datasets.
`data_sources`
`dataset_size`
`supported_tasks`

COLUMN_NAMES = {'crows-pairs': {'mask1': ['mask1'], 'mask2': ['mask2'], 'sentence': ['sentence']}, 'ner': {'chunk': ['chunk_tags', 'chunk_tag'], 'ner': ['label', 'labels ', 'class', 'classes', 'ner_tag', 'ner_tags', 'ner', 'entity'], 'pos': ['pos_tags', 'pos_tag', 'pos', 'part_of_speech'], 'text': ['text', 'sentences', 'sentence', 'sample', 'tokens']}, 'question-answering': {'answer': ['answer', 'answer_and_def_correct_predictions', 'ground_truth'], 'context': ['context', 'passage', 'contract'], 'options': ['options'], 'text': ['question']}, 'summarization': {'summary': ['summary'], 'text': ['text', 'document']}, 'text-classification': {'label': ['label', 'labels ', 'class', 'classes'], 'text': ['text', 'sentences', 'sentence', 'sample']}}#

A class to handle CSV files datasets. Subclass of BaseDataset.

_file_path#

The path to the data file or a dictionary containing “data_source” key with the path.

Type:: Union[str, Dict]

task#

Specifies the task of the dataset, which can be either “text-classification”,”ner” “question-answering” and “summarization”.

Type:: str

delimiter#

The delimiter used in the CSV file to separate columns (only for file_path as str).

Type:: str

export_data(data: List[Sample], output_path: str)#

Exports the data to the corresponding format and saves it to ‘output_path’.

Parameters:

data (List[Sample]) – data to export
output_path (str) – path to save the data to

load_data() → List[Sample]#

Load data from a CSV file and preprocess it based on the specified task.

Returns:: A list of preprocessed data samples.
Return type:: List[Sample]
Raises:: ValueError – If the specified task is unsupported.

Note

If ‘is_import’ is set to True in the constructor’s keyword arguments,

the data will be imported using the specified ‘file_path’ and optional ‘column_map’ for renaming columns.

If ‘is_import’ is set to False (default), the data will be loaded from

a CSV file specified in ‘file_path’, and the ‘column_map’ will be automatically matched with the dataset columns.

The supported task types are: ‘text-classification’, ‘ner’,

‘summarization’, and ‘question-answering’. The appropriate task-specific loading function will be invoked to preprocess the data.

load_raw_data(standardize_columns: bool = False) → List[Dict]#

Loads data from a csv file into raw lists of strings

Parameters:: standardize_columns (bool) – whether to standardize column names
Returns:: parsed CSV file into list of dicts
Return type:: List[Dict]