Source: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers
The ASDiv benchmark is a dataset of math word problems (MWPs) designed to evaluate the capability of various MWP solvers. The dataset is diverse in terms of both language patterns and problem types. The dataset is intended to be a challenging benchmark for natural language processing models and to encourage the development of models that can perform complex reasoning and inference.
You can see which subsets and splits are available below.
Split | Details |
---|---|
test | Testing set from the ASDiv dataset, containing 2305 question and answers andexamples. |
test-tiny | Truncated version of the test set from the ASDiv dataset, containing 50 question and answers examples. |
Example
In the evaluation process, we start by fetching original_context and original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_context and original_question, resulting in perturbed_context and perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the QAEvalChain approach from the LangChain library. Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category. For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.
category | test_type | original_context | original_question | perturbed_context | perturbed_question | expected_result | actual_result | pass |
---|---|---|---|---|---|---|---|---|
robustness | add_ocr_typo | Seven red apples and two green apples are in the basket. | How many apples are in the basket? | zeien r^ed apples an^d tvO grcen apples are i^n t^e basket. | ho^w m^any apples are i^n t^he basket? | Nine apples are in the basket. | Four apples are in the basket. | False |
Generated Results for
gpt-3.5-turbo-instruct
model fromOpenAI