Open In Colab

Source: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

The OpenBookQA dataset is a collection of multiple-choice questions that require complex reasoning and inference based on general knowledge, similar to an “open-book” exam. The questions are designed to test the ability of natural language processing models to answer questions that go beyond memorizing facts and involve understanding concepts and their relations. The dataset contains 500 questions, each with four answer choices and one correct answer. The questions cover various topics in science, such as biology, chemistry, physics, and astronomy.

You can see which subsets and splits are available below.

Split Details
test Testing set from the OpenBookQA dataset, containing 500 multiple-choice elementary-level science questions
test-tiny Truncated version of the test set from the OpenBookQA dataset, containing 50 multiple-choice examples.


In the evaluation process, we start by fetching original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_question, resulting in perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the llm_eval approach (where llm is used to evaluate the model response). Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category. For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.

category test_type original_question perturbed_question options expected_result actual_result pass
robustness add_abbreviation There is most likely going to be fog around: There is most likely going 2 b fog around: A. a marsh
B. a tundra
C. da plains
D. a desert
A. a marsh A. a marsh True

Generated Results for gpt-3.5-turbo-instruct model from OpenAI

Last updated