langtest.transform.utils.ResponseEvaluator#

class ResponseEvaluator(model, case_id, generator_model_name_or_path, generator_type, n=2)#

Bases: object

__init__(model, case_id, generator_model_name_or_path, generator_type, n=2)#
Minimal constructor for the evaluator:
  • model: The OpenAI model name or ID to use.

  • benchmark: An object providing criteria and case data (e.g., benchmark.get_criteria_scores).

  • case_id: The specific ID of the case to evaluate.

  • generator_model_name_or_path, generator_type, evaluator_model_name_or_path:

    Tracking info for the aggregator’s output.

  • n: Number of parallel responses requested from the OpenAI chat completion.

Methods

__init__(model, case_id, ...[, n])

Minimal constructor for the evaluator:

aggregate_results(benchmark_data, ...)

Aggregates scores for each question in evaluation_results and computes overall totals. :returns: Per-question scores plus a final totals row aggregated_data (dict): Dictionary of all case-level and question-level scores :rtype: results_df (pd.DataFrame).

calculate_confidence_rate(res_mean)

Measures how strongly the average pass rate deviates from 0 or 1.

calculate_fail_rate(valid_evals_len)

Proportion of attempts that returned no valid evaluation.

calculate_major_vote(res_matrix)

Returns a majority-vote boolean vector, average pass rates, and an overall confidence measure.

calculate_score(criteria_scores, ...)

Given a list of numeric criteria_scores and corresponding True/False evaluations, sum the scores where the evaluation is True.

evaluate_reask_responses(benchmark_data, ...)

Evaluate re-asks (follow-up prompts) for a list of responses.

evaluate_responses(benchmark_data, ...)

Main entry point for evaluating initial responses and any corresponding re-asks, then combining results.

extract_booleans(text)

Extracts 'true'/'false' from a string and converts to boolean values.

run(generator_response_str, criteria_list)

Evaluates the generator_response_str against a list of criteria, returning booleans indicating pass/fail for each criterion.

aggregate_results(benchmark_data, evaluation_results)#

Aggregates scores for each question in evaluation_results and computes overall totals. :returns: Per-question scores plus a final totals row

aggregated_data (dict): Dictionary of all case-level and question-level scores

Return type:

results_df (pd.DataFrame)

calculate_confidence_rate(res_mean)#

Measures how strongly the average pass rate deviates from 0 or 1. The closer each criterion is to an integer (0 or 1), the higher the confidence.

calculate_fail_rate(valid_evals_len)#

Proportion of attempts that returned no valid evaluation.

calculate_major_vote(res_matrix)#

Returns a majority-vote boolean vector, average pass rates, and an overall confidence measure.

calculate_score(criteria_scores, evaluation_bools)#

Given a list of numeric criteria_scores and corresponding True/False evaluations, sum the scores where the evaluation is True.

evaluate_reask_responses(benchmark_data, reask_responses)#

Evaluate re-asks (follow-up prompts) for a list of responses.

evaluate_responses(benchmark_data, responses, reask_responses)#

Main entry point for evaluating initial responses and any corresponding re-asks, then combining results.

extract_booleans(text)#

Extracts ‘true’/’false’ from a string and converts to boolean values.

run(generator_response_str, criteria_list)#

Evaluates the generator_response_str against a list of criteria, returning booleans indicating pass/fail for each criterion. Includes logic to split the criteria list if too many evaluations fail.