Overview
This notebook provides a comprehensive overview of benchmarking Language Models (LLMs) in Question-Answering tasks. we are dpoing conducting robustness and accuracy tests to evaluate LLM performance. we’re conducting Robustness and Accuracy testing on the mistralai/Mistral-7B-Instruct-v0.1
model for the OpenBookQA dataset.
Open in Collab
Category | Hub | Task | Datset Used | Open In Colab |
---|---|---|---|---|
Robustness, Accuracy | Hugging Face Inference API | Question-Answering | OpenBookQA |
Config Used
evaluation:
hub: openai
metric: llm_eval
model: gpt-3.5-turbo-instruct
model_parameters:
max_tokens: 32
user_prompt: "You are an AI bot specializing in providing accurate and concise answers\
\ to questions. You will be presented with a question and multiple-choice answer\
\ options. Your task is to choose the correct answer.\nNote: Do not explain your\
\ answer.\nQuestion: {question}\nOptions: {options}\n Answer:"
tests:
defaults:
min_pass_rate: 0.65
robustness:
add_abbreviation:
min_pass_rate: 0.75
add_ocr_typo:
min_pass_rate: 0.75
add_slangs:
min_pass_rate: 0.75
add_speech_to_text_typo:
min_pass_rate: 0.75
add_typo:
min_pass_rate: 0.75
adjective_synonym_swap:
min_pass_rate: 0.75
dyslexia_word_swap:
min_pass_rate: 0.75
lowercase:
min_pass_rate: 0.75
titlecase:
min_pass_rate: 0.75
uppercase:
min_pass_rate: 0.75