**Source:** MMLU: Measuring Massive Multitask Language Understanding

The MMLU dataset is a collection of questions and answers that cover many subjects across various domains, such as STEM, humanities, social sciences, and more. The dataset is designed to measure the performance of language understanding models on a wide range of tasks, such as elementary mathematics, US history, computer science, law, and more. Each sample has a question, and 4 choices and one of them is correct. The dataset can be used to evaluate the models’ ability to reason, recall facts, and apply knowledge to different scenarios.

You can see which subsets and splits are available below.

Split | Details |
---|---|

test |
Test set from the MMLU dataset which covers 57 tasks including elementary mathematics, US history, computer science, law, and more. We took 50 samples from each task in the test set. |

test-tiny |
Truncated version of the test set from the MMLU dataset which covers 57 tasks including elementary mathematics, US history, computer science, law, and more. We took 10 samples from each task in the test-tiny set. |

clinical |
Curated version of the MMLU dataset which contains the clinical subsets (college_biology, college_medicine, medical_genetics, human_aging, professional_medicine, nutrition). |

#### Example

In the evaluation process, we start by fetching *original_question* from the dataset. The model then generates an *expected_result* based on this input. To assess model robustness, we introduce perturbations to the *original_question*, resulting in *perturbed_question*. The model processes these perturbed inputs, producing an *actual_result*. The comparison between the *expected_result* and *actual_result* is conducted using the `llm_eval`

approach (where llm is used to evaluate the model response). Alternatively, users can employ metrics like **String Distance** or **Embedding Distance** to evaluate the model’s performance in the Question-Answering Task within the robustness category. For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.

category | test_type | original_question | perturbed_question | options | expected_result | actual_result | pass |
---|---|---|---|---|---|---|---|

robustness | add_speech_to_text_typo | Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. | Find the degree for the given field extension Kew(sqrt(2), sqrt(3), sqrt(18)) over Q. | A. 0 B. 4 C. 2 D. 6 |
B. 4 | B. 4 | True |

Generated Results for

`gpt-3.5-turbo-instruct`

model from`OpenAI`