Comprehensive Evaluation Framework¶
Use Draive's evaluation primitives to score model outputs consistently and keep quality criteria transparent. This guide walks through evaluators, scenarios, suites, and supporting patterns for building end-to-end evaluation flows.
Evaluator Basics¶
- Evaluators are async callables decorated with
@evaluator
that return anEvaluationScore
or a compatible numeric value. - Thresholds determine whether an evaluation passes; named levels (
"perfect"
,"excellent"
,"good"
,"fair"
,"poor"
) are easier to reason about than raw floats. EvaluationScore.performance
is reported as a percentage and can exceed 100 when a score comfortably beats its threshold.
Working with EvaluationScore
¶
from draive import EvaluationScore
score_from_float = EvaluationScore.of(0.85)
score_from_label = EvaluationScore.of("good")
score_from_boolean = EvaluationScore.of(True)
Defining an evaluator¶
from draive import evaluator
@evaluator(name="length_check", threshold="excellent")
async def check_response_length(value: str, min_length: int = 100) -> float:
actual_length = len(value)
if actual_length >= min_length:
return 1.0
return actual_length / min_length
# Prepared evaluators freeze arguments for reuse
strict_length_check = check_response_length.prepared(min_length=200)
result = await strict_length_check("This is a test response...")
assert result.passed # True when score >= excellent (0.7)
Combining Evaluators with Scenarios¶
Use evaluator_scenario
to bundle related evaluators and evaluate
to execute them together.
from collections.abc import Sequence
from draive.evaluation import evaluate, evaluator_scenario, EvaluatorResult
@evaluator_scenario(name="quality_checks")
async def evaluate_response_quality(value: str, context: str) -> Sequence[EvaluatorResult]:
return await evaluate(
value,
check_response_length.prepared(),
check_sentiment.prepared(),
check_relevance.prepared(context=context),
check_grammar.prepared(),
)
evaluate
can run evaluators concurrently. Limit concurrency when evaluators hit rate-limited services.
async def evaluate_response_quality_parallel(value: str, context: str) -> Sequence[EvaluatorResult]:
return await evaluate(
value,
check_response_length.prepared(),
check_sentiment.prepared(),
check_relevance.prepared(context=context),
check_grammar.prepared(),
concurrent_tasks=2,
)
Evaluator Suites for Regression Testing¶
Suites persist test cases, run them in bulk, and expose reporting helpers.
from pathlib import Path
from typing import Sequence
from draive import DataModel, evaluator_suite
from draive.evaluation import EvaluatorResult
class QATestCase(DataModel):
question: str
expected_topics: list[str]
min_length: int = 100
@evaluator_suite(
QATestCase,
name="qa_validation",
storage=Path("./test_cases.json"),
concurrent_evaluations=5,
)
async def qa_test_suite(parameters: QATestCase) -> Sequence[EvaluatorResult]:
answer = await generate_answer(parameters.question)
return [
await check_response_length(answer, parameters.min_length),
await check_topic_coverage(answer, parameters.expected_topics),
await check_factual_accuracy(answer, parameters.question),
]
await qa_test_suite.add_case(
QATestCase(
question="What is machine learning?",
expected_topics=["algorithms", "data", "training"],
min_length=150,
)
)
all_cases = await qa_test_suite.cases()
full_results = await qa_test_suite()
sample_results = await qa_test_suite(5)
partial_results = await qa_test_suite(0.3)
specific_results = await qa_test_suite(["case-1", "case-2"])
report = full_results.report(detailed=True, include_passed=False)
Composing and Transforming Evaluators¶
from draive.evaluation import Evaluator
conservative_eval = Evaluator.lowest(
evaluator1.prepared(),
evaluator2.prepared(),
evaluator3.prepared(),
)
optimistic_eval = Evaluator.highest(
evaluator1.prepared(),
evaluator2.prepared(),
)
# Transform inputs before delegation
field_evaluator = my_evaluator.contra_map(MyModel._.attribute.path)
normalized = my_evaluator.contra_map(lambda data: data["response"].strip().lower())
Stateful Evaluation with Haiway¶
from haiway import State, ctx
from draive import evaluator
class EvaluationConfig(State):
strict_mode: bool = False
max_retries: int = 3
@evaluator(threshold="perfect", state=[EvaluationConfig(strict_mode=True)])
async def strict_evaluator(value: str) -> float:
config = ctx.state(EvaluationConfig)
if config.strict_mode:
# Apply stricter logic
return await evaluate_strict(value)
return await evaluate_lenient(value)
Threshold Strategy¶
from draive.evaluators import (
coherence_evaluator,
completeness_evaluator,
consistency_evaluator,
creativity_evaluator,
factual_accuracy_evaluator,
forbidden_keywords_evaluator,
groundedness_evaluator,
helpfulness_evaluator,
readability_evaluator,
required_keywords_evaluator,
safety_evaluator,
similarity_evaluator,
tone_style_evaluator,
)
safety_check = safety_evaluator.with_threshold("perfect")
consistency_check = consistency_evaluator.with_threshold("perfect")
forbidden_check = forbidden_keywords_evaluator.with_threshold("perfect")
helpfulness_check = helpfulness_evaluator.with_threshold("excellent")
factual_accuracy_check = factual_accuracy_evaluator.with_threshold("excellent")
tone_style_check = tone_style_evaluator.with_threshold("excellent")
completeness_check = completeness_evaluator.with_threshold("good")
creativity_check = creativity_evaluator.with_threshold("good")
readability_check = readability_evaluator.with_threshold("good")
similarity_check = similarity_evaluator.with_threshold("fair")
keyword_check = required_keywords_evaluator.with_threshold("fair")
precise_check = factual_accuracy_evaluator.with_threshold(0.85)
Threshold guidelines
- Safety & compliance: use
"perfect"
; violations are unacceptable. - Core quality: use
"excellent"
for user-facing content. - Supportive signals: use
"good"
or lower when outcomes are subjective.
Rich Metadata¶
from datetime import datetime
from draive import EvaluationScore, evaluator
@evaluator
async def evaluate_with_context(response: str) -> EvaluationScore:
score, issues = await analyze_response(response)
return EvaluationScore(
value=score,
meta={
"timestamp": datetime.now().isoformat(),
"issues_found": issues,
"evaluation_model": "gpt-4",
"confidence": 0.85,
},
)
Generating Test Cases¶
examples = [
TestCase(
input="Complex technical scenario",
expected_behavior="Detailed technical response",
edge_cases=["unicode", "special chars", "empty input"],
),
TestCase(
input="Simple query",
expected_behavior="Concise response",
edge_cases=["typos", "ambiguity"],
),
]
cases = await suite.generate_cases(
count=20,
examples=examples,
guidelines="""
Generate diverse test cases covering:
- Different complexity levels
- Various input formats
- Edge cases and error conditions
- Performance boundaries
""",
persist=True,
)
Summary¶
- Flexible scoring with normalized values and named levels
- Composable evaluators with thresholds and metadata
- Scenario grouping for related checks
- Suite management with persistent storage and generation tools
- Reporting helpers for insight into failures and regressions
- Concurrent execution to balance latency and throughput
Use evaluators for quick checks, scenarios for logical groupings, and suites for comprehensive regression coverage backed by persistent cases and automated generation.