SocialIQA

Source: SocialIQA: Commonsense Reasoning about Social Interactions

SocialIQA is a dataset for testing the social commonsense reasoning of language models. It consists of over 1900 multiple-choice questions about various social situations and their possible outcomes or implications. The questions are based on real-world prompts from online platforms, and the answer candidates are either human-curated or machine-generated and filtered. The dataset challenges the models to understand the emotions, intentions, and social norms of human interactions.

You can see which subsets and splits are available below.

Split	Details
test	Testing set from the SIQA dataset, containing 1954 question and answer examples.
test-tiny	Truncated version of SIQA-test dataset which contains 50 question and answer examples.

Example

In the evaluation process, we start by fetching original_context and original_question from the dataset. The model then generates an expected_result based on this input. To assess model robustness, we introduce perturbations to the original_context and original_question, resulting in perturbed_context and perturbed_question. The model processes these perturbed inputs, producing an actual_result. The comparison between the expected_result and actual_result is conducted using the llm_eval approach (where llm is used to evaluate the model response). Alternatively, users can employ metrics like String Distance or Embedding Distance to evaluate the model’s performance in the Question-Answering Task within the robustness category.For a more in-depth exploration of these approaches, you can refer to this notebook discussing these three methods.

category	test_type	original_context	original_question	perturbed_context	perturbed_question	options	expected_result	actual_result	pass
robustness	dyslexia_word_swap	Tracy didn’t go home that evening and resisted Riley’s attacks.	What does Tracy need to do before this?	Tracy didn’t go home that evening and resisted Riley’s attacks.	What does Tracy need too do before this?	A. make a new plan B. Go home and sea Riley C. Find somewhere too go	C. Find somewhere to go	A. Make a new plan	False

Generated Results for gpt-3.5-turbo-instruct model from OpenAI