Open In Colab

Source: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers

The ASDiv benchmark is a dataset of math word problems (MWPs) designed to evaluate the capability of various MWP solvers. The dataset is diverse in terms of both language patterns and problem types. The dataset is intended to be a challenging benchmark for natural language processing models and to encourage the development of models that can perform complex reasoning and inference.

You can see which subsets and splits are available below.

Split Details
test Testing set from the ASDiv dataset, containing 2305 question and answers andexamples.
test-tiny Truncated version of the test set from the ASDiv dataset, containing 50 question and answers examples.


In the evaluation process, we start by fetching original_context and original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_context and original_question, resulting in perturbed_context and perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the QAEvalChain approach from the LangChain library. Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category. For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.

category test_type original_context original_question perturbed_context perturbed_question expected_result actual_result pass
robustness add_ocr_typo Seven red apples and two green apples are in the basket. How many apples are in the basket? zeien r^ed apples an^d tvO grcen apples are i^n t^e basket. ho^w m^any apples are i^n t^he basket? Nine apples are in the basket. Four apples are in the basket. False

Generated Results for gpt-3.5-turbo-instruct model from OpenAI

Last updated