langtest.transform.utils.ResponseEvaluator#

class ResponseEvaluator(model, case_id, generator_model_name_or_path, generator_type, n=2)#

Bases: object

__init__(model, case_id, generator_model_name_or_path, generator_type, n=2)#

Minimal constructor for the evaluator:

model: The OpenAI model name or ID to use.
benchmark: An object providing criteria and case data (e.g., benchmark.get_criteria_scores).
case_id: The specific ID of the case to evaluate.
generator_model_name_or_path, generator_type, evaluator_model_name_or_path:
Tracking info for the aggregator’s output.
n: Number of parallel responses requested from the OpenAI chat completion.

Methods

`__init__`(model, case_id, ...[, n])	Minimal constructor for the evaluator:
`aggregate_results`(benchmark_data, ...)	Aggregates scores for each question in evaluation_results and computes overall totals. :returns: Per-question scores plus a final totals row aggregated_data (dict): Dictionary of all case-level and question-level scores :rtype: results_df (pd.DataFrame).
`calculate_confidence_rate`(res_mean)	Measures how strongly the average pass rate deviates from 0 or 1.
`calculate_fail_rate`(valid_evals_len)	Proportion of attempts that returned no valid evaluation.
`calculate_major_vote`(res_matrix)	Returns a majority-vote boolean vector, average pass rates, and an overall confidence measure.
`calculate_score`(criteria_scores, ...)	Given a list of numeric criteria_scores and corresponding True/False evaluations, sum the scores where the evaluation is True.
`evaluate_reask_responses`(benchmark_data, ...)	Evaluate re-asks (follow-up prompts) for a list of responses.
`evaluate_responses`(benchmark_data, ...)	Main entry point for evaluating initial responses and any corresponding re-asks, then combining results.
`extract_booleans`(text)	Extracts 'true'/'false' from a string and converts to boolean values.
`run`(generator_response_str, criteria_list)	Evaluates the generator_response_str against a list of criteria, returning booleans indicating pass/fail for each criterion.

aggregate_results(benchmark_data, evaluation_results)#

Aggregates scores for each question in evaluation_results and computes overall totals. :returns: Per-question scores plus a final totals row

aggregated_data (dict): Dictionary of all case-level and question-level scores

calculate_confidence_rate(res_mean)#: Measures how strongly the average pass rate deviates from 0 or 1. The closer each criterion is to an integer (0 or 1), the higher the confidence.

calculate_fail_rate(valid_evals_len)#: Proportion of attempts that returned no valid evaluation.

calculate_major_vote(res_matrix)#: Returns a majority-vote boolean vector, average pass rates, and an overall confidence measure.

calculate_score(criteria_scores, evaluation_bools)#: Given a list of numeric criteria_scores and corresponding True/False evaluations, sum the scores where the evaluation is True.

evaluate_reask_responses(benchmark_data, reask_responses)#: Evaluate re-asks (follow-up prompts) for a list of responses.

evaluate_responses(benchmark_data, responses, reask_responses)#: Main entry point for evaluating initial responses and any corresponding re-asks, then combining results.

extract_booleans(text)#: Extracts ‘true’/’false’ from a string and converts to boolean values.

run(generator_response_str, criteria_list)#: Evaluates the generator_response_str against a list of criteria, returning booleans indicating pass/fail for each criterion. Includes logic to split the criteria list if too many evaluations fail.