NQ-open

Source: Natural Questions: A Benchmark for Question Answering Research

The Natural Questions dataset is a large-scale collection of real user questions and answers from Wikipedia. It is designed to evaluate the performance of automatic question answering systems on open-domain questions. The dataset included in LangTest about 3,500 test examples. Each example consists of a query and one or more answers from the wikipedia page.

You can see which subsets and splits are available below.

Split	Details
combined	Training & development set from the NaturalQuestions dataset, containing 3,569 labeled examples
test	Development set from the NaturalQuestions dataset, containing 1,769 labeled examples
test-tiny	Training, development & test set from the NaturalQuestions dataset, containing 50 labeled examples

Example

In the evaluation process, we start by fetching original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_question, resulting in perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the llm_eval approach (where llm is used to evaluate the model response). Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category. For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.

category	test_type	original_question	perturbed_question	expected_result	actual_result	pass
robustness	add_abbreviation	on the 6th day of christmas my true love sent to me	on da 6th day of christmas my true <3333 sent 2 me	Six geese a-laying.	On the sixth day of Christmas, my true love sent to me two turtle doves.	False

Generated Results for gpt-3.5-turbo-instruct model from OpenAI