LLM Testing Notebooks

 

The following table gives an overview of the different tutorial notebooks to test various LLMs. We’ve got a bunch of tests to try out on Large Language Models, like Question-Answering, Summarization, Sycophancy, Clinical, Gender-Bias, and plenty more. Please refer the below table for more options.

Tutorial Description Hub Task Open In Colab
OpenAI QA/Summarization :OpenAI Model Testing For Question Answering and Summarization OpenAI Question-Answering/Summarization Open In Colab
AI21 QA/Summarization : Ai21 Model Testing For Question Answering and Summarization AI21 Question-Answering/Summarization Open In Colab
Cohere QA/Summarization : Cohere Model Testing For Question Answering and Summarization Cohere Question-Answering/Summarization Open In Colab
Hugging Face Inference API QA/Summarization : Hugging Face Inference API Model Testing For Question Answering and Summarization Hugging Face Inference API Question-Answering/Summarization Open In Colab
Hugging Face Hub QA/Summarization : Hugging Face Hub Model Testing For Question Answering and Summarization Hugging Face Hub Question-Answering/Summarization Open In Colab
Azure-OpenAI QA/Summarization : Azure-OpenAI Model Testing For Question Answering and Summarization Azure-OpenAI Question-Answering/Summarization Open In Colab
Toxicity : Evaluating gpt-3.5-turbo-instruct model on toxicity test OpenAI Text-Generation Open In Colab
Clinical : Assess any demographic bias the model might exhibit when suggesting treatment plans for two patients with identical diagnoses. OpenAI Text-Generation Open In Colab
Ideology : Evaluating the model in capturing nuanced political beliefs beyond the traditional left-right spectrum. OpenAI Question-Answering Open In Colab
Disinformation :In this tutorial, we assess the model’s capability to generate disinformation. AI21 Text-Generation Open In Colab
Factuality: In this tutorial, we assess how well LLMs can identify the factual accuracy of summary sentences. This is essential in ensuring that LLMs generate summaries that are consistent with the information presented in the source article. OpenAI Question-Answering Open In Colab
Legal: In this tutorial, we assess the model on LegalSupport dataset. Each sample consists of a text passage making a legal claim, and two case summaries. OpenAI Question-Answering Open In Colab
Security: In this tutorial, we assess the prompt injection vulnerabilities in LLMs. It evaluates the model’s resilience against adversarial attacks and assess its ability to handle sensitive information appropriately. OpenAI Text-Generation Open In Colab
Sensitivity: In this tutorial, we assess the model sensitivity by adding negation and toxic words to see analyze the behaviour in the LLM response. OpenAI Question-Answering Open In Colab
Sycophancy : It is an undesirable behavior in which models tailor their responses to align with a human user’s view, even when that view is not objectively correct. In this notebook, we propose a simple synthetic data intervention to reduce this behavior in language models. OpenAI Question-Answering Open In Colab
Stereotype: In this tutorial, we assess the model on gender occupational stereotype statements. OpenAI/AI21 Question-Answering Open In Colab
LM Studio: Running Hugging Face quantized models through LM-Studio and testing these models for a Question Answering task. LM Studio Question-Answering Open In Colab
Question Answering Benchmarking: This notebook provides a demo on benchmarking Language Models (LLMs) for Question-Answering tasks. Hugging Face Inference API Question-Answering Open In Colab
Fewshot Model Evaluation: This notebook provides a demo on Optimize and evaluate your models using few-shot prompt techniques OpenAI Question-Answering Open In Colab
Evaluating NER in LLMs:In this tutorial, we assess the support for Named Entity Recognition (NER) tasks specifically for Large Language Models (LLMs) OpenAI Question-Answering Open In Colab
Swapping Drug Names Test:In this notebook, we discussed implementing tests that facilitate the swapping of generic drug names with brand names and vice versa. This feature ensures accurate evaluations in medical and pharmaceutical contexts. OpenAI Question-Answering Open In Colab
Last updated