Evaluator Catalog: Complete Guide to Available Evaluators¶
This comprehensive guide covers all predefined evaluators in the Draive framework, organized by category with practical examples and real-world usage patterns.
Overview¶
Draive provides 20 predefined evaluators covering all major dimensions of LLM evaluation:
- Quality Evaluators: Assess content structure, style, and linguistic quality
- Reference-Based Evaluators: Compare generated content against reference material
- User-Focused Evaluators: Measure how well content serves user needs
- Safety & Content Evaluators: Detect harmful content and verify factual accuracy
- Similarity Evaluators: Measure semantic and vector-based similarity
- Utility Evaluators: Rule-based checks for specific requirements
All evaluators return scores using a consistent 5-point scale: poor
(0.1), fair
(0.3), good
(0.5), excellent
(0.7), perfect
(0.9), with none
(0.0) for unevaluable content.
Quality Evaluators¶
Coherence Evaluator¶
Purpose: Evaluates logical flow and structural organization of content.
from draive.evaluators import coherence_evaluator
reference_text = "AI systems process data through algorithms. These algorithms learn patterns. Learned patterns enable predictions."
generated_text = "Machine learning algorithms analyze data to identify patterns, which then enable accurate predictions about future outcomes."
result = await coherence_evaluator(
generated_text,
reference=reference_text,
guidelines="Focus on logical flow and clear transitions between ideas"
)
print(f"Coherence: {result.score.value}") # Output: "excellent"
Best for: Evaluating reports, explanations, structured content where logical organization matters.
Conciseness Evaluator¶
Purpose: Measures brevity while maintaining completeness of information.
from draive.evaluators import conciseness_evaluator
reference_text = "Python is a programming language"
verbose_text = "Python happens to be a high-level programming language that is widely used"
concise_text = "Python is a popular programming language"
verbose_result = await conciseness_evaluator(verbose_text, reference=reference_text)
concise_result = await conciseness_evaluator(concise_text, reference=reference_text)
print(f"Verbose: {verbose_result.score.value}") # Output: "fair"
print(f"Concise: {concise_result.score.value}") # Output: "excellent"
Best for: Executive summaries, product descriptions, social media content where brevity is valued.
Fluency Evaluator¶
Purpose: Assesses grammar, spelling, punctuation, and natural language flow.
from draive.evaluators import fluency_evaluator
poor_text = "This text have many grammar error and spelling mistaks."
good_text = "This text has proper grammar and correct spelling throughout."
poor_result = await fluency_evaluator(poor_text)
good_result = await fluency_evaluator(good_text)
print(f"Poor fluency: {poor_result.score.value}") # Output: "poor"
print(f"Good fluency: {good_result.score.value}") # Output: "perfect"
Best for: All content types as a basic quality check, especially user-facing text.
Readability Evaluator¶
Purpose: Evaluates ease of understanding and accessibility of content.
from draive.evaluators import readability_evaluator
complex_text = "The utilization of sophisticated methodological approaches necessitates comprehensive understanding."
simple_text = "Using advanced methods requires deep understanding."
complex_result = await readability_evaluator(complex_text)
simple_result = await readability_evaluator(simple_text)
print(f"Complex: {complex_result.score.value}") # Output: "poor"
print(f"Simple: {simple_result.score.value}") # Output: "excellent"
Best for: Educational content, documentation, public-facing materials targeting broad audiences.
Reference-Based Evaluators¶
Coverage Evaluator¶
Purpose: Ensures generated content includes all key points from reference material.
from draive.evaluators import coverage_evaluator
reference = """
Key features of renewable energy:
1. Sustainable and environmentally friendly
2. Reduces carbon emissions significantly
3. Lower long-term operational costs
4. Energy independence from fossil fuels
"""
partial_coverage = "Renewable energy is sustainable and reduces emissions."
full_coverage = "Renewable energy offers sustainability, reduces carbon emissions, provides cost savings, and ensures energy independence."
partial_result = await coverage_evaluator(partial_coverage, reference=reference)
full_result = await coverage_evaluator(full_coverage, reference=reference)
print(f"Partial: {partial_result.score.value}") # Output: "fair"
print(f"Full: {full_result.score.value}") # Output: "excellent"
Best for: Summarization tasks, report generation, ensuring comprehensive information transfer.
Consistency Evaluator¶
Purpose: Checks factual alignment and consistency with reference content.
from draive.evaluators import consistency_evaluator
reference = "The company was founded in 2010 and has 500 employees."
consistent_text = "Founded in 2010, the company now employs 500 people."
inconsistent_text = "The company, established in 2008, has grown to 750 employees."
consistent_result = await consistency_evaluator(consistent_text, reference=reference)
inconsistent_result = await consistency_evaluator(inconsistent_text, reference=reference)
print(f"Consistent: {consistent_result.score.value}") # Output: "perfect"
print(f"Inconsistent: {inconsistent_result.score.value}") # Output: "poor"
Best for: Fact-checking, ensuring generated content aligns with source material, preventing hallucinations.
Groundedness Evaluator¶
Purpose: Verifies content is anchored in and traceable to source material.
from draive.evaluators import groundedness_evaluator
source_material = """
Research Study Results:
- 78% of participants showed improvement
- Study duration: 6 months
- Sample size: 1,200 participants
- Published in Journal of Medical Research, 2023
"""
grounded_text = "According to a 2023 study in the Journal of Medical Research involving 1,200 participants over 6 months, 78% showed improvement."
ungrounded_text = "Most people improve with this treatment, and studies show it's very effective for everyone."
grounded_result = await groundedness_evaluator.with_threshold("excellent")(grounded_text, reference=source_material)
ungrounded_result = await groundedness_evaluator.with_threshold("excellent")(ungrounded_text, reference=source_material)
print(f"Grounded: {grounded_result.score.value} ({'✓' if grounded_result.passed else '✗'})") # Output: "excellent ✓"
print(f"Ungrounded: {ungrounded_result.score.value} ({'✓' if ungrounded_result.passed else '✗'})") # Output: "poor ✗"
Best for: Scientific content, legal documents, journalism, any content requiring citation and source attribution.
Relevance Evaluator¶
Purpose: Measures selection of important information while avoiding redundancy.
from draive.evaluators import relevance_evaluator
reference = """
Product Features:
- 256GB storage capacity
- 12-hour battery life
- Waterproof design (IP68 rating)
- 4K video recording capability
- Wireless charging support
- Advanced security features
"""
relevant_text = "Key highlights: 256GB storage, 12-hour battery, waterproof (IP68), and 4K video recording."
irrelevant_text = "This product has storage and battery. It also comes in a nice box with documentation and warranty information."
relevant_result = await relevance_evaluator(relevant_text, reference=reference)
irrelevant_result = await relevance_evaluator(irrelevant_text, reference=reference)
print(f"Relevant: {relevant_result.score.value}") # Output: "excellent"
print(f"Irrelevant: {irrelevant_result.score.value}") # Output: "poor"
Best for: Product descriptions, feature highlights, content curation where focus is important.
Truthfulness Evaluator¶
Purpose: Evaluates factual correctness against reference material.
from draive.evaluators import truthfulness_evaluator
reference = "Water boils at 100°C (212°F) at sea level atmospheric pressure."
truthful_text = "Water reaches its boiling point at 100 degrees Celsius or 212 degrees Fahrenheit under standard atmospheric pressure."
false_text = "Water boils at 90°C (194°F) under normal conditions."
truthful_result = await truthfulness_evaluator(truthful_text, reference=reference)
false_result = await truthfulness_evaluator(false_text, reference=reference)
print(f"Truthful: {truthful_result.score.value}") # Output: "perfect"
print(f"False: {false_result.score.value}") # Output: "poor"
Best for: Educational content, scientific explanations, fact verification against known sources.
User-Focused Evaluators¶
Helpfulness Evaluator¶
Purpose: Assesses how effectively content addresses user needs and questions.
from draive.evaluators import helpfulness_evaluator
user_query = "How do I reset my password on the mobile app?"
helpful_response = """
To reset your password on the mobile app:
1. Open the app and tap 'Forgot Password' on the login screen
2. Enter your email address
3. Check your email for a reset link
4. Follow the link and create a new password
5. Return to the app and log in with your new password
If you don't receive the email, check your spam folder or contact support.
"""
unhelpful_response = "You can reset your password through the app. Check the settings or contact support if needed."
helpful_result = await helpfulness_evaluator.with_threshold("excellent")(helpful_response, user_query=user_query)
unhelpful_result = await helpfulness_evaluator.with_threshold("excellent")(unhelpful_response, user_query=user_query)
print(f"Helpful: {helpful_result.score.value} ({'✓' if helpful_result.passed else '✗'})") # Output: "excellent ✓"
print(f"Unhelpful: {unhelpful_result.score.value} ({'✓' if unhelpful_result.passed else '✗'})") # Output: "fair ✗"
Best for: Customer support responses, FAQ answers, instructional content, chatbot responses.
Completeness Evaluator¶
Purpose: Ensures all aspects of a user query are fully addressed.
from draive.evaluators import completeness_evaluator
user_query = "What are the system requirements, pricing, and supported platforms for your software?"
complete_response = """
System Requirements:
- Windows 10+ or macOS 10.15+
- 8GB RAM minimum, 16GB recommended
- 2GB free disk space
Pricing:
- Basic plan: $19/month
- Professional: $49/month
- Enterprise: Custom pricing
Supported Platforms:
- Windows, macOS, Linux
- Mobile: iOS and Android apps
- Web browser access available
"""
incomplete_response = "Our software runs on Windows and Mac. We have different pricing plans available including Basic and Professional options."
complete_result = await completeness_evaluator(complete_response, user_query=user_query)
incomplete_result = await completeness_evaluator(incomplete_response, user_query=user_query)
print(f"Complete: {complete_result.score.value}") # Output: "excellent"
print(f"Incomplete: {incomplete_result.score.value}") # Output: "fair"
Best for: Multi-part questions, complex inquiries, ensuring comprehensive responses.
Expectations Evaluator¶
Purpose: Measures fulfillment of specific defined criteria or expectations.
from draive.evaluators import expectations_evaluator
expectations = """
Blog post requirements:
- Length: 800-1200 words
- Include at least 3 actionable tips
- Add relevant examples or case studies
- Maintain professional but conversational tone
- Include clear introduction and conclusion
"""
content = """
# Effective Remote Work Strategies
Working remotely has become the norm for many professionals. Here's how to make it work for you.
## Introduction
The shift to remote work presents unique challenges and opportunities...
## 3 Key Tips for Success
1. Create a dedicated workspace...
2. Establish clear boundaries...
3. Invest in reliable technology...
## Real-World Example
Sarah, a marketing manager, transformed her spare room into a productive office space...
## Conclusion
Remote work success depends on intentional planning and consistent habits...
[Word count: 950 words]
"""
result = await expectations_evaluator.with_threshold("excellent")(
content,
expectations=expectations,
guidelines="Evaluate each requirement separately and provide specific feedback"
)
print(f"Expectations met: {result.score.value} ({'✓' if result.passed else '✗'})") # Output: "excellent ✓"
Best for: Content creation guidelines, project requirements, quality assurance checklists.
Safety & Content Evaluators¶
Safety Evaluator¶
Purpose: Detects harmful, dangerous, or inappropriate content.
from draive.evaluators import safety_evaluator
safe_content = "Here's a healthy recipe for vegetable soup with nutritious ingredients."
unsafe_content = "Instructions for creating dangerous substances that could harm people."
safe_result = await safety_evaluator.with_threshold("perfect")(
safe_content,
guidelines="Focus on potential physical, emotional, or social harm"
)
unsafe_result = await safety_evaluator.with_threshold("perfect")(unsafe_content)
print(f"Safe content: {safe_result.score.value} ({'✓' if safe_result.passed else '✗'})") # Output: "perfect ✓"
print(f"Unsafe content: {unsafe_result.score.value} ({'✓' if unsafe_result.passed else '✗'})") # Output: "poor ✗"
Best for: Content moderation, user-generated content, ensuring platform safety guidelines compliance.
Factual Accuracy Evaluator¶
Purpose: Verifies factual correctness using general knowledge (no reference required).
from draive.evaluators import factual_accuracy_evaluator
factual_content = "The capital of France is Paris, and it's located in Western Europe."
inaccurate_content = "The capital of France is Lyon, and it's the largest city in Eastern Europe."
factual_result = await factual_accuracy_evaluator(factual_content)
inaccurate_result = await factual_accuracy_evaluator(inaccurate_content)
print(f"Factual: {factual_result.score.value}") # Output: "perfect"
print(f"Inaccurate: {inaccurate_result.score.value}") # Output: "poor"
Best for: Educational content, general knowledge verification, fact-checking without specific sources.
Tone/Style Evaluator¶
Purpose: Evaluates alignment with expected tone and writing style.
from draive.evaluators import tone_style_evaluator
expected_tone = """
Professional but approachable tone
- Use active voice
- Avoid jargon and technical terms
- Include empathetic language
- Maintain confident but humble stance
"""
appropriate_content = "We understand this situation can be frustrating. Let's work together to find the best solution for your needs."
inappropriate_content = "Your complaint has been logged. Our technical team will process your request according to standard protocols."
appropriate_result = await tone_style_evaluator(
appropriate_content,
expected_tone_style=expected_tone
)
inappropriate_result = await tone_style_evaluator(
inappropriate_content,
expected_tone_style=expected_tone
)
print(f"Appropriate tone: {appropriate_result.score.value}") # Output: "excellent"
print(f"Inappropriate tone: {inappropriate_result.score.value}") # Output: "fair"
Best for: Brand voice consistency, customer communications, content matching specific style guides.
Creativity Evaluator¶
Purpose: Measures originality, novelty, and innovative thinking.
from draive.evaluators import creativity_evaluator
generic_content = "Our product is the best solution for your business needs. It offers great features and excellent value."
creative_content = """
Imagine your business as a garden. Traditional solutions are like using a watering can – they get the job done, but require constant attention. Our platform is like installing a smart irrigation system that learns your garden's unique needs, adapts to weather patterns, and grows more efficient over time.
"""
generic_result = await creativity_evaluator(generic_content)
creative_result = await creativity_evaluator(
creative_content,
guidelines="Evaluate originality of metaphors, unique perspectives, and innovative presentation"
)
print(f"Generic: {generic_result.score.value}") # Output: "poor"
print(f"Creative: {creative_result.score.value}") # Output: "excellent"
Best for: Marketing copy, creative writing, brainstorming content, innovative problem-solving explanations.
Similarity Evaluators¶
Semantic Similarity Evaluator¶
Purpose: Measures semantic similarity between two pieces of content using LLM evaluation.
from draive.evaluators import similarity_evaluator
reference_text = "Machine learning algorithms require large datasets to train effectively."
similar_text = "AI models need substantial amounts of data for proper training."
dissimilar_text = "Weather forecasting helps predict tomorrow's temperature."
similar_result = await similarity_evaluator(similar_text, reference=reference_text)
dissimilar_result = await similarity_evaluator(dissimilar_text, reference=reference_text)
print(f"Similar content: {similar_result.score.value}") # Output: "perfect"
print(f"Dissimilar content: {dissimilar_result.score.value}") # Output: "poor"
Best for: Duplicate detection, paraphrasing evaluation, content matching.
Vector Similarity Evaluators¶
Purpose: Calculate mathematical similarity using embedding vectors.
from draive.evaluators import text_vector_similarity_evaluator, image_vector_similarity_evaluator
# Text vector similarity
text1 = "Natural language processing"
text2 = "NLP and text analysis"
text_similarity = await text_vector_similarity_evaluator(text2, reference=text1)
print(f"Text similarity: {text_similarity:.3f}") # Output: 0.842
# Image vector similarity (requires image data)
with open("image1.jpg", "rb") as f:
image1_data = f.read()
with open("image2.jpg", "rb") as f:
image2_data = f.read()
image_similarity = await image_vector_similarity_evaluator(image2_data, reference=image1_data)
print(f"Image similarity: {image_similarity:.3f}") # Output: 0.756
Best for: Recommendation systems, content deduplication, similarity search, clustering.
Utility Evaluators¶
Keyword Evaluators¶
Purpose: Rule-based checking for required or forbidden keywords.
from draive.evaluators import required_keywords_evaluator, forbidden_keywords_evaluator
content = "Our AI-powered solution uses machine learning algorithms to analyze customer data."
# Check for required keywords
required_result = await required_keywords_evaluator(
content,
keywords=["AI", "machine learning", "customer"],
require_all=True # All keywords must be present
)
# Check for forbidden keywords
forbidden_result = await forbidden_keywords_evaluator(
content,
keywords=["hack", "exploit", "unauthorized"],
require_none=True # None of these should be present
)
print(f"Required keywords found: {required_result.score.value}") # Output: 1.0
print(f"Forbidden keywords absent: {forbidden_result.score.value}") # Output: 1.0
# Partial matching with custom normalization
def custom_normalize(text):
return text.lower().replace("-", " ")
partial_result = await required_keywords_evaluator(
content,
keywords=["AI", "machine learning", "analytics", "blockchain"],
require_all=False, # Allow partial matches
normalization=custom_normalize
)
print(f"Partial keyword score: {partial_result.score.value}") # Output: 0.75 (3 out of 4 keywords)
Best for: Content compliance, SEO requirements, content filtering, policy enforcement.
Real-World Evaluation Scenarios¶
Content Marketing Evaluation¶
from collections.abc import Sequence
from draive.evaluation import evaluate, evaluator_scenario, EvaluatorResult
from draive.evaluators import *
@evaluator_scenario(name="marketing_content_quality")
async def evaluate_marketing_content(
content: str,
brand_guidelines: str,
target_keywords: list[str]
) -> Sequence[EvaluatorResult]:
"""Comprehensive marketing content evaluation."""
return await evaluate(
content,
# Quality checks
fluency_evaluator.prepared(),
readability_evaluator.prepared(),
creativity_evaluator.prepared(),
# Brand compliance
tone_style_evaluator.prepared(expected_tone_style=brand_guidelines),
required_keywords_evaluator.prepared(keywords=target_keywords, require_all=False),
# Safety and accuracy
safety_evaluator.prepared(),
factual_accuracy_evaluator.prepared(),
concurrent_tasks=3 # Control concurrency
)
# Usage example
marketing_copy = "Discover our revolutionary AI platform that transforms how businesses connect with customers..."
brand_guide = """
Tone: Professional yet friendly
Voice: Confident and helpful
Style: Clear, benefit-focused language
Avoid: Technical jargon, superlatives without proof
"""
result = await evaluate_marketing_content(
marketing_copy,
brand_guide,
["AI", "platform", "business", "customers"]
)
print(f"Marketing content passed: {result.passed}")
for eval_result in result.evaluations:
print(f"- {eval_result.evaluator}: {eval_result.score.value}")
Customer Support Response Evaluation¶
@evaluator_scenario(name="support_response_quality")
async def evaluate_support_response(
response: str,
customer_query: str,
company_policy: str
) -> Sequence[EvaluatorResult]:
"""Evaluate customer support response quality."""
return await evaluate(
response,
# User focus
helpfulness_evaluator.prepared(user_query=customer_query),
completeness_evaluator.prepared(user_query=customer_query),
# Quality and safety
fluency_evaluator.prepared(),
safety_evaluator.prepared(),
# Policy compliance
consistency_evaluator.prepared(reference=company_policy),
tone_style_evaluator.prepared(expected_tone_style="Professional, empathetic, solution-focused"),
concurrent_tasks=2
)
Academic Content Evaluation¶
@evaluator_scenario(name="academic_content_review")
async def evaluate_academic_content(
content: str,
source_material: str,
academic_standards: str
) -> Sequence[EvaluatorResult]:
"""Evaluate academic content against standards."""
return await evaluate(
content,
# Accuracy and grounding
factual_accuracy_evaluator.prepared(),
groundedness_evaluator.prepared(reference=source_material),
consistency_evaluator.prepared(reference=source_material),
# Academic quality
coherence_evaluator.prepared(reference=source_material),
coverage_evaluator.prepared(reference=source_material),
# Standards compliance
expectations_evaluator.prepared(expectations=academic_standards),
concurrent_tasks=3
)
Best Practices¶
1. Choose Appropriate Evaluators and Thresholds¶
# For user-facing content - prioritize user experience and safety
user_focused_evaluators = [
helpfulness_evaluator.with_threshold("excellent"), # High bar for user satisfaction
completeness_evaluator.with_threshold("good"), # Moderate - some flexibility
safety_evaluator.with_threshold("perfect"), # Critical - no compromise
readability_evaluator.with_threshold("good") # Moderate - depends on audience
]
# For content with source material - accuracy is paramount
reference_based_evaluators = [
coverage_evaluator.with_threshold("excellent"), # High - ensure key points covered
consistency_evaluator.with_threshold("perfect"), # Critical - no contradictions
groundedness_evaluator.with_threshold("excellent"), # High - must cite sources
truthfulness_evaluator.with_threshold("excellent") # High - accuracy matters
]
# For creative content - balance creativity with quality
creative_evaluators = [
creativity_evaluator.with_threshold("good"), # Moderate - allow variety
tone_style_evaluator.with_threshold("excellent"), # High - brand consistency
fluency_evaluator.with_threshold("excellent") # High - basic quality requirement
]
2. Threshold Selection Strategy¶
Choose thresholds based on business impact and user consequences:
# PERFECT (0.9) - Zero tolerance areas
safety_evaluator.with_threshold("perfect") # User safety
forbidden_keywords_evaluator.with_threshold("perfect") # Compliance
consistency_evaluator.with_threshold("perfect") # No contradictions
# EXCELLENT (0.7) - High quality requirements
helpfulness_evaluator.with_threshold("excellent") # User satisfaction
factual_accuracy_evaluator.with_threshold("excellent") # Information quality
tone_style_evaluator.with_threshold("excellent") # Brand consistency
# GOOD (0.5) - Balanced quality standards
completeness_evaluator.with_threshold("good") # Reasonable coverage
creativity_evaluator.with_threshold("good") # Allow variety
readability_evaluator.with_threshold("good") # Accessible but flexible
# FAIR (0.3) - Minimum acceptable standards
similarity_evaluator.with_threshold("fair") # Loose matching
required_keywords_evaluator.with_threshold("fair") # Flexible keyword matching
3. Use Concurrent Evaluation for Performance¶
from draive.evaluation import evaluate
# Run independent evaluators concurrently
results = await evaluate(
content,
safety_evaluator.prepared(),
factual_accuracy_evaluator.prepared(),
creativity_evaluator.prepared(),
fluency_evaluator.prepared(),
concurrent_tasks=2 # Limit concurrent tasks to avoid rate limits
)
4. Provide Context with Guidelines¶
# Specific evaluation context improves accuracy
result = await tone_style_evaluator(
content,
expected_tone_style=expected_style,
guidelines="""
Focus on:
- Formality level appropriate for B2B audience
- Use of inclusive language
- Consistency with brand voice
- Professional yet approachable tone
"""
)
Summary¶
The Draive evaluator catalog provides comprehensive coverage for evaluating LLM outputs across all major dimensions:
- 20 specialized evaluators covering quality, safety, user needs, and content requirements
- Consistent 5-point scoring with meaningful categorical levels
- Flexible composition allowing complex evaluation scenarios
- Real-world optimized with guidelines support and concurrent execution
Choose evaluators based on your specific use case, combine them in scenarios for comprehensive testing, and use appropriate thresholds based on the criticality of each evaluation dimension.