ASDiv

Source: A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers

The ASDiv benchmark is a dataset of math word problems (MWPs) designed to evaluate the capability of various MWP solvers. The dataset is diverse in terms of both language patterns and problem types. The dataset is intended to be a challenging benchmark for natural language processing models and to encourage the development of models that can perform complex reasoning and inference.

You can see which subsets and splits are available below.

Split	Details
test	Testing set from the ASDiv dataset, containing 2305 question and answers andexamples.
test-tiny	Truncated version of the test set from the ASDiv dataset, containing 50 question and answers examples.

Example

In the evaluation process, we start by fetching original_context and original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_context and original_question, resulting in perturbed_context and perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the QAEvalChain approach from the LangChain library. Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category. For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.

category	test_type	original_context	original_question	perturbed_context	perturbed_question	expected_result	actual_result	pass
robustness	add_ocr_typo	Seven red apples and two green apples are in the basket.	How many apples are in the basket?	zeien r^ed apples an^d tvO grcen apples are i^n t^e basket.	ho^w m^any apples are i^n t^he basket?	Nine apples are in the basket.	Four apples are in the basket.	False

Generated Results for gpt-3.5-turbo-instruct model from OpenAI