AI agents are everywhere - from customer support to content moderation. However, the transition from demo to production often reveals critical performance gaps. Drawing from experience with AI agents that excel in demos but falter in real-world applications, this post addresses the crucial need for comprehensive evaluation frameworks that pinpoint specific failure areas.
We'll explore advanced evaluation techniques using the OpenAI Agents SDK and Pydantic Evals, focusing on metadata-driven analysis for unprecedented performance insights.
Our goal is to construct a robust customer support system featuring intelligent query routing, multi-level performance evaluation, scalable architecture, and precise diagnostic capabilities. This framework, applicable beyond customer support, will equip you with the tools to build resilient multi-agent systems ready for real-world deployment.
Request flow:
Steps:
-
Customer query: The process starts with a customer submitting a query.
-
Triage agent: The query is received by the Triage agent, which determines where to send it.
-
Handoff to specialist: The Triage agent transfers the query to the appropriate specialist agent.
-
E-commerce agent OR Airline agent: Depending on the nature of the query, it's handled by either the E-commerce agent or the Airline agent.
-
Final answer + Performance data: The specialist agent provides a final answer, and performance data is collected for evaluation.
Before we dive in and get our hands dirty, let's understand make a quick comparison of Single Agent and Multi Agent Systems, detailing their respective characteristics, advantages, and limitations across various operational aspects.
Comparison: Single Agent vs Multi Agent
Aspect | Single Agent system | Multi-Agent system |
---|---|---|
Context management | Maintains full conversation history | Context must be shared between agents |
Processing speed | One task at a time | Multiple tasks simultaneously |
Token costs | ~4x baseline token usage | ~15x baseline token usage |
System reliability | Consistent, deterministic outputs | Unpredictable agent interactions |
Troubleshooting | Clear execution path to follow | Multiple failure points to investigate |
Agent coordination | No coordination overhead | Requires careful orchestration |
Main advantage | Coherent reasoning & dependable results | Speed through specialization & parallelism |
Biggest limitation | Token limits & slower throughput | Lost context & coordination overhead |
Ideal use cases | Deep analysis, content creation, code refactoring | Data gathering, research, distributed processing |
Examples | Writing technical documentation, debugging complex issues | Competitive analysis, scraping company data at scale |
Building your agent fleet
Let's dive in by starting with domain experts
1. Specialist agents
Create specialized agents for different domains to handle specific types of queries.
2. Individual agent functions
Develop separate evaluation functions for each agent to process and classify queries within their expertise.
3. Individual agent datasets
Construct datasets tailored to each specialist for targeted testing and evaluation.
# E-commerce Dataset for Individual Testing
ecommerce_cases = [
Case(
name='ecommerce_refund_query',
inputs='How do I return a product?',
expected_output='Refunds'
),
Case(
name='ecommerce_informational_query',
inputs='What is your return policy?',
expected_output='Informational'
),
# ... more test cases
]
ecommerce_dataset = Dataset(cases=ecommerce_cases)
ecommerce_dataset.add_evaluator(IsExactMatch())
4. Custom evaluator
Implement a custom evaluator to assess the accuracy of agent responses against expected outputs.
5. The Triage agent (The magic happens here)
Design a central triage agent to route incoming queries to the appropriate specialist agents.
Why handoffs are better than manual routing:
- Context gets preserved automatically between agents
- Built-in error handling for failed transfers
- No weird conversation breaks or lost context
- Production-ready reliability out of the box
6. Triage system function
Create a function to manage the entire triage process, from initial query to final classification.
async def run_triage_system(question: str) -> str:
"""
Runs the triage agent which will handoff to the appropriate specialist.
Returns the final classification from the specialist agent.
"""
result = await Runner.run(triage_agent, question)
return result.final_output.strip()
The metadata game changer
This changed everything for me: Most teams can't figure out WHY their agents fail. Metadata fixes that.
Here's the breakthrough - I use pydantic_evals metadata to track routing AND classification performance at the same time:
Why this transforms everything
The metadata enables two-level performance analysis:
-
Routing Level: Did the triage agent choose the right specialist?
domain: 'ecommerce'
vsdomain: 'airline'
-
Classification Level: Did the specialist categorize correctly within their domain?
category: 'refunds'
,'shipping'
,'faq'
,'booking'
, etc.
Before metadata: "System is 75% accurate" (completely useless for debugging)
After metadata: "Routing is 95% accurate, but ecommerce refund classification needs work" (actionable insights!)
See everything in action
# Get the evaluation report
report = await combined_dataset.evaluate(run_triage_system)
# Show metadata in the results table - instant insights!
report.print(include_metadata=True)
This produces output that actually helps you debug:
Case ID | Metadata | Scores |
---|---|---|
triage_refund_query | {'domain': 'ecommerce', 'category': 'refunds'} | IsExactMatch: 1.00 |
triage_faq_query | {'domain': 'airline', 'category': 'faq'} | IsExactMatch: 1.00 |
📊 This is what separates toys from production systems.
My complete evaluation process
Step 1: Test individual agents first (Baseline performance)
Establish baseline performance for each specialist agent before integrating them into the triage system.
# Test E-commerce Agent individually
print("Testing E-commerce Agent...")
ecommerce_report = await ecommerce_dataset.evaluate(classify_ecommerce_query)
print(ecommerce_report)
# Test Airline Agent individually
print("Testing Airline Agent...")
airline_report = await airline_dataset.evaluate(classify_airline_query)
print(airline_report)
This tells me if my specialists are working correctly before I add the complexity of routing.
Step 2: Test the full triage pipeline
Evaluate the complete multi-agent system to assess end-to-end performance, including routing accuracy.
# Test the complete triage system with mixed domain queries
print("Testing Complete Triage System...")
combined_report = await combined_dataset.evaluate(run_triage_system)
combined_report.print(include_metadata=True)
This shows me end-to-end performance including routing accuracy.
Step 3: Analyze performance by Domain
Utilize metadata to perform in-depth analysis of performance patterns across different domains and categories.
# Analyze performance distribution by domain
ecommerce_cases = [c for c in combined_cases if c.metadata.get('domain') == 'ecommerce']
airline_cases = [c for c in combined_cases if c.metadata.get('domain') == 'airline']
print(f"E-commerce test cases: {len(ecommerce_cases)}")
print(f"Airline test cases: {len(airline_cases)}")
# You can also analyze by category within each domain
ecommerce_categories = set(c.metadata.get('category') for c in ecommerce_cases)
airline_categories = set(c.metadata.get('category') for c in airline_cases)
print(f"E-commerce categories covered: {ecommerce_categories}")
print(f"Airline categories covered: {airline_categories}")
Conclusion
In this blog post, we demonstrated a robust multi-agent triage system for handling diverse customer queries across different domains. We showcased how to set up specialist agents, implement a triage mechanism, and utilize metadata for in-depth performance analysis. This approach provides a scalable framework that can be easily adapted to different industries or expanded to include additional domains and query types.