LLM Testing Notebooks

The following table gives an overview of the different tutorial notebooks to test various LLMs. We’ve got a bunch of tests to try out on Large Language Models, like Question-Answering, Summarization, Sycophancy, Clinical, Gender-Bias, and plenty more. Please refer the below table for more options.

Tutorial Description	Hub	Task
OpenAI QA/Summarization :OpenAI Model Testing For Question Answering and Summarization	OpenAI	Question-Answering/Summarization
AI21 QA/Summarization : Ai21 Model Testing For Question Answering and Summarization	AI21	Question-Answering/Summarization
Cohere QA/Summarization : Cohere Model Testing For Question Answering and Summarization	Cohere	Question-Answering/Summarization
Hugging Face Inference API QA/Summarization : Hugging Face Inference API Model Testing For Question Answering and Summarization	Hugging Face Inference API	Question-Answering/Summarization
Hugging Face Hub QA/Summarization : Hugging Face Hub Model Testing For Question Answering and Summarization	Hugging Face Hub	Question-Answering/Summarization
Azure-OpenAI QA/Summarization : Azure-OpenAI Model Testing For Question Answering and Summarization	Azure-OpenAI	Question-Answering/Summarization
Toxicity : Evaluating `gpt-3.5-turbo-instruct` model on toxicity test	OpenAI	Text-Generation
Clinical : Assess any demographic bias the model might exhibit when suggesting treatment plans for two patients with identical diagnoses.	OpenAI	Text-Generation
Ideology : Evaluating the model in capturing nuanced political beliefs beyond the traditional left-right spectrum.	OpenAI	Question-Answering
Disinformation :In this tutorial, we assess the model’s capability to generate disinformation.	AI21	Text-Generation
Factuality: In this tutorial, we assess how well LLMs can identify the factual accuracy of summary sentences. This is essential in ensuring that LLMs generate summaries that are consistent with the information presented in the source article.	OpenAI	Question-Answering
Legal: In this tutorial, we assess the model on LegalSupport dataset. Each sample consists of a text passage making a legal claim, and two case summaries.	OpenAI	Question-Answering
Security: In this tutorial, we assess the prompt injection vulnerabilities in LLMs. It evaluates the model’s resilience against adversarial attacks and assess its ability to handle sensitive information appropriately.	OpenAI	Text-Generation
Sensitivity: In this tutorial, we assess the model sensitivity by adding negation and toxic words to see analyze the behaviour in the LLM response.	OpenAI	Question-Answering
Sycophancy : It is an undesirable behavior in which models tailor their responses to align with a human user’s view, even when that view is not objectively correct. In this notebook, we propose a simple synthetic data intervention to reduce this behavior in language models.	OpenAI	Question-Answering
Stereotype: In this tutorial, we assess the model on gender occupational stereotype statements.	OpenAI/AI21	Question-Answering
LM Studio: Running Hugging Face quantized models through LM-Studio and testing these models for a Question Answering task.	LM Studio	Question-Answering
Question Answering Benchmarking: This notebook provides a demo on benchmarking Language Models (LLMs) for Question-Answering tasks.	Hugging Face Inference API	Question-Answering
Fewshot Model Evaluation: This notebook provides a demo on Optimize and evaluate your models using few-shot prompt techniques	OpenAI	Question-Answering
Evaluating NER in LLMs:In this tutorial, we assess the support for Named Entity Recognition (NER) tasks specifically for Large Language Models (LLMs)	OpenAI	Question-Answering
Swapping Drug Names Test:In this notebook, we discussed implementing tests that facilitate the swapping of generic drug names with brand names and vice versa. This feature ensures accurate evaluations in medical and pharmaceutical contexts.	OpenAI	Question-Answering