langtest.metrics.llm_eval.SummaryEval#

class SummaryEval(llm, template: str = '\nYou are a clinical QA judge. Evaluate the **Generated Summary** against the **Dialogue** and score on a 1-10 integer scale. Do all reasoning silently and output **JSON only**.\n\nSCORING RUBRICS (1-10; integers only; clamp to [1,10]; if torn between two scores, choose the lower):\n1) Factual Completeness\n- 10: Includes all material facts present in the dialogue: chief complaint, history & pertinent negatives, meds/allergies, vitals/PE/test results (if any), assessment/impression, and plan/instructions.\n- 7-9: Minor non-critical omissions; all key facts present.\n- 4-6: Several important omissions or thin coverage of key sections.\n- 1-3: Misses most key facts; largely incomplete.\n\n2) No Hallucinations\n- 10: Every claim traceable to the dialogue; no invented diagnoses, vitals, meds, demographics, or timelines.\n- 7-9: One minor unsupported inference that doesn\'t alter clinical meaning.\n- 4-6: Multiple unsupported details or speculative statements.\n- 1-3: Significant fabricated/contradictory content (e.g., invented diagnoses/meds); unsafe leaps.\n\n3) Clinical Tone & Structure\n- 10: Clear, concise, neutral clinical tone; accurate terminology; organized (SOAP-like or equivalent); no emojis, marketing, or second-person coaching.\n- 7-9: Generally clinical with small style issues (minor verbosity/ordering).\n- 4-6: Mixed tone, informal language, or disorganized structure.\n- 1-3: Non-clinical tone, confusing, or unprofessional.\n\n4) Overall Quality\n- Compute as weighted average of the first three: 40% Factual Completeness, 40% No Hallucinations, 20% Clinical Tone & Structure. Round to nearest integer (0.5 rounds up), then clamp to [1,10].\n\nEVALUATION RULES\n- Compare strictly to the Dialogue. Penalize any PHI or specifics (age, dates, doses, findings) not supported by the dialogue.\n- Credit **pertinent negatives** only if explicitly stated or clearly implied in the dialogue.\n- Do not penalize for brevity if completeness is preserved.\n- If the summary is unrelated to the dialogue or empty, set all four scores to 1.\n- Ignore minor grammar/typos in the dialogue itself; focus on the summary\'s fidelity and clinical clarity.\n\nOUTPUT FORMAT (JSON only; no prose):\n{{\n "Factual Completeness": <int 1-10>,\n "No Hallucinations": <int 1-10>,\n "Clinical Tone & Structure": <int 1-10>\n}}\n\n### Dialogue\n{dialogue}\n\n### Generated Summary\n{summary}\n\n', input_variables: List[str] = ['context', 'summary'])#

Bases: object

SummaryEval for evaluating clinical summary generation from doctor-patient dialogues.

__init__(llm, template: str = '\nYou are a clinical QA judge. Evaluate the **Generated Summary** against the **Dialogue** and score on a 1-10 integer scale. Do all reasoning silently and output **JSON only**.\n\nSCORING RUBRICS (1-10; integers only; clamp to [1,10]; if torn between two scores, choose the lower):\n1) Factual Completeness\n- 10: Includes all material facts present in the dialogue: chief complaint, history & pertinent negatives, meds/allergies, vitals/PE/test results (if any), assessment/impression, and plan/instructions.\n- 7-9: Minor non-critical omissions; all key facts present.\n- 4-6: Several important omissions or thin coverage of key sections.\n- 1-3: Misses most key facts; largely incomplete.\n\n2) No Hallucinations\n- 10: Every claim traceable to the dialogue; no invented diagnoses, vitals, meds, demographics, or timelines.\n- 7-9: One minor unsupported inference that doesn\'t alter clinical meaning.\n- 4-6: Multiple unsupported details or speculative statements.\n- 1-3: Significant fabricated/contradictory content (e.g., invented diagnoses/meds); unsafe leaps.\n\n3) Clinical Tone & Structure\n- 10: Clear, concise, neutral clinical tone; accurate terminology; organized (SOAP-like or equivalent); no emojis, marketing, or second-person coaching.\n- 7-9: Generally clinical with small style issues (minor verbosity/ordering).\n- 4-6: Mixed tone, informal language, or disorganized structure.\n- 1-3: Non-clinical tone, confusing, or unprofessional.\n\n4) Overall Quality\n- Compute as weighted average of the first three: 40% Factual Completeness, 40% No Hallucinations, 20% Clinical Tone & Structure. Round to nearest integer (0.5 rounds up), then clamp to [1,10].\n\nEVALUATION RULES\n- Compare strictly to the Dialogue. Penalize any PHI or specifics (age, dates, doses, findings) not supported by the dialogue.\n- Credit **pertinent negatives** only if explicitly stated or clearly implied in the dialogue.\n- Do not penalize for brevity if completeness is preserved.\n- If the summary is unrelated to the dialogue or empty, set all four scores to 1.\n- Ignore minor grammar/typos in the dialogue itself; focus on the summary\'s fidelity and clinical clarity.\n\nOUTPUT FORMAT (JSON only; no prose):\n{{\n "Factual Completeness": <int 1-10>,\n "No Hallucinations": <int 1-10>,\n "Clinical Tone & Structure": <int 1-10>\n}}\n\n### Dialogue\n{dialogue}\n\n### Generated Summary\n{summary}\n\n', input_variables: List[str] = ['context', 'summary'])#

Methods

`__init__`(llm[, template, input_variables])
`evaluate`(inputs, predictions)	Evaluate a list of dialogue-summary pairs.
`evaluate_batch`(inputs, predictions)	Alias for evaluate - placeholder for future batch implementation.

evaluate(inputs: dict, predictions: dict) → List[dict]#: Evaluate a list of dialogue-summary pairs.

evaluate_batch(inputs: List[dict], predictions: List[dict]) → List[dict]#: Alias for evaluate - placeholder for future batch implementation.