Source: Natural Questions: A Benchmark for Question Answering Research
The Natural Questions dataset is a large-scale collection of real user questions and answers from Wikipedia. It is designed to evaluate the performance of automatic question answering systems on open-domain questions. The dataset included in LangTest about 3,500 test examples. Each example consists of a query and one or more answers from the wikipedia page.
You can see which subsets and splits are available below.
Split | Details |
---|---|
combined | Training & development set from the NaturalQuestions dataset, containing 3,569 labeled examples |
test | Development set from the NaturalQuestions dataset, containing 1,769 labeled examples |
test-tiny | Training, development & test set from the NaturalQuestions dataset, containing 50 labeled examples |
Example
In the evaluation process, we start by fetching original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_question, resulting in perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the llm_eval
approach (where llm is used to evaluate the model response). Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category. For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.
category | test_type | original_question | perturbed_question | expected_result | actual_result | pass |
---|---|---|---|---|---|---|
robustness | add_abbreviation | on the 6th day of christmas my true love sent to me | on da 6th day of christmas my true <3333 sent 2 me | Six geese a-laying. | On the sixth day of Christmas, my true love sent to me two turtle doves. | False |
Generated Results for
gpt-3.5-turbo-instruct
model fromOpenAI