Source: CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
The CommonsenseQA dataset is a multiple-choice question answering dataset that aims to test the ability of natural language processing models to answer questions that require commonsense knowledge. The dataset consists of questions with five choices each. The questions were generated by Amazon Mechanical Turk workers, who were asked to create questions based on a given source concept and three target concepts related to it. The questions require different types of commonsense knowledge to predict the correct answers. The dataset covers various topics such as science, history, and everyday life. You can see which subsets and splits are available below.
Split | Details |
---|---|
test | Testing set from the CommonsenseQA dataset, containing 1140 questions. This dataset does not contain labels and accuracy & fairness tests cannot be run with it. |
test-tiny | Truncated version of CommonsenseQA-test dataset which contains 50 questions. This dataset does not contain labels and accuracy & fairness tests cannot be run with it. |
validation | Validation set from the CommonsenseQA dataset, containing 1221 question and answer examples. |
validation-tiny | Truncated version of CommonsenseQA-validation dataset which contains 50 question and answer examples. |
Example
In the evaluation process, we start by fetching original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_question, resulting in perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the llm_eval
approach (where llm is used to evaluate the model response). Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category. For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.
category | test_type | original_question | perturbed_question | expected_result | actual_result | pass |
---|---|---|---|---|---|---|
robustness | add_abbreviation | A revolving door is convenient for two direction travel, but it also serves as a security measure at a what? A. bank B. library C. department store D. mall E. new york |
A revolving door is convenient 4 two direction travel, but it also serves as a security measure at a wat? A. bank B. library C. department store D. mall E. new york |
A. bank | A. bank | True |
Generated Results for
gpt-3.5-turbo-instruct
model fromOpenAI