The Natural Questions dataset is a large-scale collection of real user questions and answers from Wikipedia. It is designed to evaluate the performance of automatic question answering systems on open-domain questions. The dataset included in LangTest about 3,500 test examples. Each example consists of a query and one or more answers from the wikipedia page.
You can see which subsets and splits are available below.
|Training & development set from the NaturalQuestions dataset, containing 3,569 labeled examples
|Development set from the NaturalQuestions dataset, containing 1,769 labeled examples
|Training, development & test set from the NaturalQuestions dataset, containing 50 labeled examples
In the evaluation process, we start by fetching original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_question, resulting in perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the
llm_eval approach (where llm is used to evaluate the model response). Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category. For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.
|on the 6th day of christmas my true love sent to me
|on da 6th day of christmas my true <3333 sent 2 me
|Six geese a-laying.
|On the sixth day of Christmas, my true love sent to me two turtle doves.
Generated Results for