The following table gives an overview of the different tutorial notebooks to test various LLMs. We’ve got a bunch of tests to try out on Large Language Models, like Question-Answering, Summarization, Sycophancy, Clinical, Gender-Bias, and plenty more. Please refer the below table for more options.
Tutorial Description | Hub | Task | Open In Colab |
---|---|---|---|
OpenAI QA/Summarization :OpenAI Model Testing For Question Answering and Summarization | OpenAI | Question-Answering/Summarization | |
AI21 QA/Summarization : Ai21 Model Testing For Question Answering and Summarization | AI21 | Question-Answering/Summarization | |
Cohere QA/Summarization : Cohere Model Testing For Question Answering and Summarization | Cohere | Question-Answering/Summarization | |
Hugging Face Inference API QA/Summarization : Hugging Face Inference API Model Testing For Question Answering and Summarization | Hugging Face Inference API | Question-Answering/Summarization | |
Hugging Face Hub QA/Summarization : Hugging Face Hub Model Testing For Question Answering and Summarization | Hugging Face Hub | Question-Answering/Summarization | |
Azure-OpenAI QA/Summarization : Azure-OpenAI Model Testing For Question Answering and Summarization | Azure-OpenAI | Question-Answering/Summarization | |
Toxicity : Evaluating gpt-3.5-turbo-instruct model on toxicity test |
OpenAI | Text-Generation | |
Clinical : Assess any demographic bias the model might exhibit when suggesting treatment plans for two patients with identical diagnoses. | OpenAI | Text-Generation | |
Ideology : Evaluating the model in capturing nuanced political beliefs beyond the traditional left-right spectrum. | OpenAI | Question-Answering | |
Disinformation :In this tutorial, we assess the model’s capability to generate disinformation. | AI21 | Text-Generation | |
Factuality: In this tutorial, we assess how well LLMs can identify the factual accuracy of summary sentences. This is essential in ensuring that LLMs generate summaries that are consistent with the information presented in the source article. | OpenAI | Question-Answering | |
Legal: In this tutorial, we assess the model on LegalSupport dataset. Each sample consists of a text passage making a legal claim, and two case summaries. | OpenAI | Question-Answering | |
Security: In this tutorial, we assess the prompt injection vulnerabilities in LLMs. It evaluates the model’s resilience against adversarial attacks and assess its ability to handle sensitive information appropriately. | OpenAI | Text-Generation | |
Sensitivity: In this tutorial, we assess the model sensitivity by adding negation and toxic words to see analyze the behaviour in the LLM response. | OpenAI | Question-Answering | |
Sycophancy : It is an undesirable behavior in which models tailor their responses to align with a human user’s view, even when that view is not objectively correct. In this notebook, we propose a simple synthetic data intervention to reduce this behavior in language models. | OpenAI | Question-Answering | |
Stereotype: In this tutorial, we assess the model on gender occupational stereotype statements. | OpenAI/AI21 | Question-Answering | |
LM Studio: Running Hugging Face quantized models through LM-Studio and testing these models for a Question Answering task. | LM Studio | Question-Answering | |
Question Answering Benchmarking: This notebook provides a demo on benchmarking Language Models (LLMs) for Question-Answering tasks. | Hugging Face Inference API | Question-Answering | |
Fewshot Model Evaluation: This notebook provides a demo on Optimize and evaluate your models using few-shot prompt techniques | OpenAI | Question-Answering | |
Evaluating NER in LLMs:In this tutorial, we assess the support for Named Entity Recognition (NER) tasks specifically for Large Language Models (LLMs) | OpenAI | Question-Answering | |
Swapping Drug Names Test:In this notebook, we discussed implementing tests that facilitate the swapping of generic drug names with brand names and vice versa. This feature ensures accurate evaluations in medical and pharmaceutical contexts. | OpenAI | Question-Answering |