BoolQ

 

Open In Colab

Source: BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

The BoolQ dataset is a collection of yes/no questions that are naturally occurring and generated in unprompted and unconstrained settings. The dataset contains about 16k examples, each consisting of a question, a passage, and an answer. The questions are about various topics and require reading comprehension and reasoning skills to answer. The dataset is intended to explore the surprising difficulty of natural yes/no questions and to benchmark natural language understanding systems.

You can see which subsets and splits are available below.

Split Details
combined Training, development & test set from the BoolQ dataset, containing 15,942 labeled examples
dev Dev set from the BoolQ dataset, containing 3,270 labeled examples
dev-tiny Truncated version of the dev set from the BoolQ dataset, containing 50 labeled examples
test Test set from the BoolQ dataset, containing 3,245 labeled examples. This dataset does not contain labels and accuracy & fairness tests cannot be run with it.
test-tiny Truncated version of the test set from the BoolQ dataset, containing 50 labeled examples. This dataset does not contain labels and accuracy & fairness tests cannot be run with it.
bias Manually annotated bias version of BoolQ dataset, containing 136 labeled examples

Example

In the evaluation process, we start by fetching original_context and original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_context and original_question, resulting in perturbed_context and perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the llm_eval approach (where llm is used to evaluate the model response). Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category.For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.

category test_type original_context original_question perturbed_context perturbed_question expected_result actual_result pass
robustness add_abbreviation 20 euro note – Until now there has been only one complete series of euro notes; however a new series, similar to the current one, is being released. The European Central Bank will, in due time, announce when banknotes from the first series lose legal tender status. is the first series 20 euro note still legal tender 20 euro note – Until now there has been only one complete series of euro notes; however a new series, similar 2 da current one, is being released. TdaEuropean Central Bank will, in due time, announce when banknotes from thdast series lose legal tender status. is da 1st series 20 euro note still legal tender False False True

Generated Results for gpt-3.5-turbo-instruct model from OpenAI

Last updated