langtest.metrics.llm_eval.LlmEval#

class LlmEval(llm, template="You are a teacher grading a quiz.\nYou are given a question, the student's answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\n\nExample Format:\nQUESTION: question here\nSTUDENT ANSWER: student's answer here\nTRUE ANSWER: true answer here\nGRADE: CORRECT or INCORRECT here\n\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more or relevant information than the true answer, as long as it does not contain any conflicting statements. Begin!\n\nQUESTION: {query}\nSTUDENT ANSWER: {result}\nTRUE ANSWER: {answer}\nGRADE:", input_variables=['query', 'result', 'answer'])#

Bases: object

llm_eval for evaluating question answering.

__init__(llm, template="You are a teacher grading a quiz.\nYou are given a question, the student's answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\n\nExample Format:\nQUESTION: question here\nSTUDENT ANSWER: student's answer here\nTRUE ANSWER: true answer here\nGRADE: CORRECT or INCORRECT here\n\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more or relevant information than the true answer, as long as it does not contain any conflicting statements. Begin!\n\nQUESTION: {query}\nSTUDENT ANSWER: {result}\nTRUE ANSWER: {answer}\nGRADE:", input_variables=['query', 'result', 'answer'])#

Initializes the LlmEval object.

Parameters:

llm – The language model for evaluation.
template – Template for model prompts.
input_variables – Variables expected in the input.
server_prompt – Server prompt for model predictions.

Raises:

ValueError – If input variables do not match expected values.

Methods

`__init__`(llm[, template, input_variables])	Initializes the LlmEval object.
`evaluate`(inputs, predictions[, ...])	Evaluate question answering examples and predictions.
`evaluate_batch`(examples)	Evaluates a batch of examples using the language model.
`evaluate_example`(example)	Evaluates a single example using the language model.

evaluate(inputs: List[dict], predictions: List[dict], question_key: str = 'question', answer_key: str = 'answer', prediction_key: str = 'result') → List[dict]#: Evaluate question answering examples and predictions.

evaluate_batch(examples: List[dict]) → List[dict]#

Evaluates a batch of examples using the language model.

Parameters:: examples – List of dictionaries containing input details.
Returns:: List of evaluation results for each example.
Return type:: List[dict]

evaluate_example(example: dict) → dict#

Evaluates a single example using the language model.

Parameters:: example – Dictionary containing input details.
Returns:: Evaluation results with reasoning, value, and score.
Return type:: dict