langtest.datahandler.datasource.CSVDataset#
- class CSVDataset(file_path: str | Dict, task: TaskManager, **kwargs)#
Bases:
BaseDataset
- __init__(file_path: str | Dict, task: TaskManager, **kwargs) None #
Initializes a CustomCSVDataset object.
- Parameters:
file_path (Union[str, Dict]) – The path to the data file or a dictionary containing the following keys: - “data_source”: The path to the data file. - “feature_column” (optional): Specifies the column containing input features. - “target_column” (optional): Specifies the column containing target labels.
task (str) – Specifies the task of the dataset, which can be one of the following: - “text-classification” - “ner” (Named Entity Recognition) - “question-answering” - “summarization”
**kwargs – Additional keyword arguments that can be used to configure the dataset (optional).
Methods
__init__
(file_path, task, **kwargs)Initializes a CustomCSVDataset object.
export_data
(data, output_path)Exports the data to the corresponding format and saves it to 'output_path'.
Load data from a CSV file and preprocess it based on the specified task.
load_raw_data
([standardize_columns])Loads data from a csv file into raw lists of strings
Attributes
A class to handle CSV files datasets.
data_sources
supported_tasks
- COLUMN_NAMES = {'crows-pairs': {'mask1': ['mask1'], 'mask2': ['mask2'], 'sentence': ['sentence']}, 'ner': {'chunk': ['chunk_tags', 'chunk_tag'], 'ner': ['label', 'labels ', 'class', 'classes', 'ner_tag', 'ner_tags', 'ner', 'entity'], 'pos': ['pos_tags', 'pos_tag', 'pos', 'part_of_speech'], 'text': ['text', 'sentences', 'sentence', 'sample', 'tokens']}, 'question-answering': {'answer': ['answer', 'answer_and_def_correct_predictions', 'ground_truth'], 'context': ['context', 'passage', 'contract'], 'options': ['options'], 'text': ['question']}, 'summarization': {'summary': ['summary'], 'text': ['text', 'document']}, 'text-classification': {'label': ['label', 'labels ', 'class', 'classes'], 'text': ['text', 'sentences', 'sentence', 'sample']}}#
A class to handle CSV files datasets. Subclass of BaseDataset.
- _file_path#
The path to the data file or a dictionary containing “data_source” key with the path.
- Type:
Union[str, Dict]
- task#
Specifies the task of the dataset, which can be either “text-classification”,”ner” “question-answering” and “summarization”.
- Type:
str
- delimiter#
The delimiter used in the CSV file to separate columns (only for file_path as str).
- Type:
str
- export_data(data: List[Sample], output_path: str)#
Exports the data to the corresponding format and saves it to ‘output_path’.
- Parameters:
data (List[Sample]) – data to export
output_path (str) – path to save the data to
- load_data() List[Sample] #
Load data from a CSV file and preprocess it based on the specified task.
- Returns:
A list of preprocessed data samples.
- Return type:
List[Sample]
- Raises:
ValueError – If the specified task is unsupported.
Note
If ‘is_import’ is set to True in the constructor’s keyword arguments,
the data will be imported using the specified ‘file_path’ and optional ‘column_map’ for renaming columns.
If ‘is_import’ is set to False (default), the data will be loaded from
a CSV file specified in ‘file_path’, and the ‘column_map’ will be automatically matched with the dataset columns.
The supported task types are: ‘text-classification’, ‘ner’,
‘summarization’, and ‘question-answering’. The appropriate task-specific loading function will be invoked to preprocess the data.
- load_raw_data(standardize_columns: bool = False) List[Dict] #
Loads data from a csv file into raw lists of strings
- Parameters:
standardize_columns (bool) – whether to standardize column names
- Returns:
parsed CSV file into list of dicts
- Return type:
List[Dict]