OpenBookQA

Source: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

The OpenBookQA dataset is a collection of multiple-choice questions that require complex reasoning and inference based on general knowledge, similar to an “open-book” exam. The questions are designed to test the ability of natural language processing models to answer questions that go beyond memorizing facts and involve understanding concepts and their relations. The dataset contains 500 questions, each with four answer choices and one correct answer. The questions cover various topics in science, such as biology, chemistry, physics, and astronomy.

You can see which subsets and splits are available below.

Split	Details
test	Testing set from the OpenBookQA dataset, containing 500 multiple-choice elementary-level science questions
test-tiny	Truncated version of the test set from the OpenBookQA dataset, containing 50 multiple-choice examples.

Example

In the evaluation process, we start by fetching original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_question, resulting in perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the llm_eval approach (where llm is used to evaluate the model response). Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category. For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.

category	test_type	original_question	perturbed_question	options	expected_result	actual_result	pass
robustness	add_abbreviation	There is most likely going to be fog around:	There is most likely going 2 b fog around:	A. a marsh B. a tundra C. da plains D. a desert	A. a marsh	A. a marsh	True

Generated Results for gpt-3.5-turbo-instruct model from OpenAI