Comprehensive Evaluation Framework¶

Use Draive's evaluation primitives to score model outputs consistently and keep quality criteria transparent. This guide walks through evaluators, scenarios, suites, and supporting patterns for building end-to-end evaluation flows.

Evaluator Basics¶

Evaluators are async callables decorated with @evaluator that return an EvaluationScore or a compatible numeric value.
Thresholds determine whether an evaluation passes; named levels ("perfect", "excellent", "good", "fair", "poor") are easier to reason about than raw floats.
EvaluationScore.performance is reported as a percentage and can exceed 100 when a score comfortably beats its threshold.

Working with `EvaluationScore`¶

from draive import EvaluationScore

score_from_float = EvaluationScore.of(0.85)
score_from_label = EvaluationScore.of("good")
score_from_boolean = EvaluationScore.of(True)

Defining an evaluator¶

from draive import evaluator

@evaluator(name="length_check", threshold="excellent")
async def check_response_length(value: str, min_length: int = 100) -> float:
    actual_length = len(value)
    if actual_length >= min_length:
        return 1.0
    return actual_length / min_length

# Prepared evaluators freeze arguments for reuse
strict_length_check = check_response_length.prepared(min_length=200)
result = await strict_length_check("This is a test response...")
assert result.passed  # True when score >= excellent (0.7)

Combining Evaluators with Scenarios¶

Use evaluator_scenario to bundle related evaluators and evaluate to execute them together.

from collections.abc import Sequence

from draive.evaluation import evaluate, evaluator_scenario, EvaluatorResult

@evaluator_scenario(name="quality_checks")
async def evaluate_response_quality(value: str, context: str) -> Sequence[EvaluatorResult]:
    return await evaluate(
        value,
        check_response_length.prepared(),
        check_sentiment.prepared(),
        check_relevance.prepared(context=context),
        check_grammar.prepared(),
    )

evaluate can run evaluators concurrently. Limit concurrency when evaluators hit rate-limited services.

async def evaluate_response_quality_parallel(value: str, context: str) -> Sequence[EvaluatorResult]:
    return await evaluate(
        value,
        check_response_length.prepared(),
        check_sentiment.prepared(),
        check_relevance.prepared(context=context),
        check_grammar.prepared(),
        concurrent_tasks=2,
    )

Evaluator Suites for Regression Testing¶

Suites persist test cases, run them in bulk, and expose reporting helpers.

from pathlib import Path
from typing import Sequence

from draive import DataModel, evaluator_suite
from draive.evaluation import EvaluatorResult


class QATestCase(DataModel):
    question: str
    expected_topics: list[str]
    min_length: int = 100


@evaluator_suite(
    QATestCase,
    name="qa_validation",
    storage=Path("./test_cases.json"),
    concurrent_evaluations=5,
)
async def qa_test_suite(parameters: QATestCase) -> Sequence[EvaluatorResult]:
    answer = await generate_answer(parameters.question)

    return [
        await check_response_length(answer, parameters.min_length),
        await check_topic_coverage(answer, parameters.expected_topics),
        await check_factual_accuracy(answer, parameters.question),
    ]

await qa_test_suite.add_case(
    QATestCase(
        question="What is machine learning?",
        expected_topics=["algorithms", "data", "training"],
        min_length=150,
    )
)

all_cases = await qa_test_suite.cases()
full_results = await qa_test_suite()
sample_results = await qa_test_suite(5)
partial_results = await qa_test_suite(0.3)
specific_results = await qa_test_suite(["case-1", "case-2"])

report = full_results.report(detailed=True, include_passed=False)

Composing and Transforming Evaluators¶

from draive.evaluation import Evaluator

conservative_eval = Evaluator.lowest(
    evaluator1.prepared(),
    evaluator2.prepared(),
    evaluator3.prepared(),
)
optimistic_eval = Evaluator.highest(
    evaluator1.prepared(),
    evaluator2.prepared(),
)

# Transform inputs before delegation
field_evaluator = my_evaluator.contra_map(MyModel._.attribute.path)
normalized = my_evaluator.contra_map(lambda data: data["response"].strip().lower())

Stateful Evaluation with Haiway¶

from haiway import State, ctx

from draive import evaluator


class EvaluationConfig(State):
    strict_mode: bool = False
    max_retries: int = 3


@evaluator(threshold="perfect", state=[EvaluationConfig(strict_mode=True)])
async def strict_evaluator(value: str) -> float:
    config = ctx.state(EvaluationConfig)
    if config.strict_mode:
        # Apply stricter logic
        return await evaluate_strict(value)
    return await evaluate_lenient(value)

Threshold Strategy¶

from draive.evaluators import (
    coherence_evaluator,
    completeness_evaluator,
    consistency_evaluator,
    creativity_evaluator,
    factual_accuracy_evaluator,
    forbidden_keywords_evaluator,
    groundedness_evaluator,
    helpfulness_evaluator,
    readability_evaluator,
    required_keywords_evaluator,
    safety_evaluator,
    similarity_evaluator,
    tone_style_evaluator,
)

safety_check = safety_evaluator.with_threshold("perfect")
consistency_check = consistency_evaluator.with_threshold("perfect")
forbidden_check = forbidden_keywords_evaluator.with_threshold("perfect")

helpfulness_check = helpfulness_evaluator.with_threshold("excellent")
factual_accuracy_check = factual_accuracy_evaluator.with_threshold("excellent")
tone_style_check = tone_style_evaluator.with_threshold("excellent")

completeness_check = completeness_evaluator.with_threshold("good")
creativity_check = creativity_evaluator.with_threshold("good")
readability_check = readability_evaluator.with_threshold("good")

similarity_check = similarity_evaluator.with_threshold("fair")
keyword_check = required_keywords_evaluator.with_threshold("fair")

precise_check = factual_accuracy_evaluator.with_threshold(0.85)

Threshold guidelines

Safety & compliance: use "perfect"; violations are unacceptable.
Core quality: use "excellent" for user-facing content.
Supportive signals: use "good" or lower when outcomes are subjective.

Rich Metadata¶

from datetime import datetime

from draive import EvaluationScore, evaluator


@evaluator
async def evaluate_with_context(response: str) -> EvaluationScore:
    score, issues = await analyze_response(response)

    return EvaluationScore(
        value=score,
        meta={
            "timestamp": datetime.now().isoformat(),
            "issues_found": issues,
            "evaluation_model": "gpt-4",
            "confidence": 0.85,
        },
    )

Generating Test Cases¶

examples = [
    TestCase(
        input="Complex technical scenario",
        expected_behavior="Detailed technical response",
        edge_cases=["unicode", "special chars", "empty input"],
    ),
    TestCase(
        input="Simple query",
        expected_behavior="Concise response",
        edge_cases=["typos", "ambiguity"],
    ),
]

cases = await suite.generate_cases(
    count=20,
    examples=examples,
    guidelines="""
    Generate diverse test cases covering:
    - Different complexity levels
    - Various input formats
    - Edge cases and error conditions
    - Performance boundaries
    """,
    persist=True,
)

Summary¶

Flexible scoring with normalized values and named levels
Composable evaluators with thresholds and metadata
Scenario grouping for related checks
Suite management with persistent storage and generation tools
Reporting helpers for insight into failures and regressions
Concurrent execution to balance latency and throughput

Use evaluators for quick checks, scenarios for logical groupings, and suites for comprehensive regression coverage backed by persistent cases and automated generation.