langtest.transform.utils.ResponseEvaluator#
- class ResponseEvaluator(model, case_id, generator_model_name_or_path, generator_type, n=2)#
Bases:
object
- __init__(model, case_id, generator_model_name_or_path, generator_type, n=2)#
- Minimal constructor for the evaluator:
model: The OpenAI model name or ID to use.
benchmark: An object providing criteria and case data (e.g., benchmark.get_criteria_scores).
case_id: The specific ID of the case to evaluate.
- generator_model_name_or_path, generator_type, evaluator_model_name_or_path:
Tracking info for the aggregator’s output.
n: Number of parallel responses requested from the OpenAI chat completion.
Methods
__init__
(model, case_id, ...[, n])Minimal constructor for the evaluator:
aggregate_results
(benchmark_data, ...)Aggregates scores for each question in evaluation_results and computes overall totals. :returns: Per-question scores plus a final totals row aggregated_data (dict): Dictionary of all case-level and question-level scores :rtype: results_df (pd.DataFrame).
calculate_confidence_rate
(res_mean)Measures how strongly the average pass rate deviates from 0 or 1.
calculate_fail_rate
(valid_evals_len)Proportion of attempts that returned no valid evaluation.
calculate_major_vote
(res_matrix)Returns a majority-vote boolean vector, average pass rates, and an overall confidence measure.
calculate_score
(criteria_scores, ...)Given a list of numeric criteria_scores and corresponding True/False evaluations, sum the scores where the evaluation is True.
evaluate_reask_responses
(benchmark_data, ...)Evaluate re-asks (follow-up prompts) for a list of responses.
evaluate_responses
(benchmark_data, ...)Main entry point for evaluating initial responses and any corresponding re-asks, then combining results.
extract_booleans
(text)Extracts 'true'/'false' from a string and converts to boolean values.
run
(generator_response_str, criteria_list)Evaluates the generator_response_str against a list of criteria, returning booleans indicating pass/fail for each criterion.
- aggregate_results(benchmark_data, evaluation_results)#
Aggregates scores for each question in evaluation_results and computes overall totals. :returns: Per-question scores plus a final totals row
aggregated_data (dict): Dictionary of all case-level and question-level scores
- Return type:
results_df (pd.DataFrame)
- calculate_confidence_rate(res_mean)#
Measures how strongly the average pass rate deviates from 0 or 1. The closer each criterion is to an integer (0 or 1), the higher the confidence.
- calculate_fail_rate(valid_evals_len)#
Proportion of attempts that returned no valid evaluation.
- calculate_major_vote(res_matrix)#
Returns a majority-vote boolean vector, average pass rates, and an overall confidence measure.
- calculate_score(criteria_scores, evaluation_bools)#
Given a list of numeric criteria_scores and corresponding True/False evaluations, sum the scores where the evaluation is True.
- evaluate_reask_responses(benchmark_data, reask_responses)#
Evaluate re-asks (follow-up prompts) for a list of responses.
- evaluate_responses(benchmark_data, responses, reask_responses)#
Main entry point for evaluating initial responses and any corresponding re-asks, then combining results.
- extract_booleans(text)#
Extracts ‘true’/’false’ from a string and converts to boolean values.
- run(generator_response_str, criteria_list)#
Evaluates the generator_response_str against a list of criteria, returning booleans indicating pass/fail for each criterion. Includes logic to split the criteria list if too many evaluations fail.