How to evaluate Multi-Agent AI systems with OpenAI SDK

AI agents are everywhere - from customer support to content moderation. However, the transition from demo to production often reveals critical performance gaps. Drawing from experience with AI agents that excel in demos but falter in real-world applications, this post addresses the crucial need for comprehensive evaluation frameworks that pinpoint specific failure areas.

We'll explore advanced evaluation techniques using the OpenAI Agents SDK and Pydantic Evals, focusing on metadata-driven analysis for unprecedented performance insights.

Our goal is to construct a robust customer support system featuring intelligent query routing, multi-level performance evaluation, scalable architecture, and precise diagnostic capabilities. This framework, applicable beyond customer support, will equip you with the tools to build resilient multi-agent systems ready for real-world deployment.

Request flow:

Customer query
Triage agent
Handoff to specialist
E-commerce agent
OR
Airline agent
Final answer + Performance data

Steps:

  1. Customer query: The process starts with a customer submitting a query.

  2. Triage agent: The query is received by the Triage agent, which determines where to send it.

  3. Handoff to specialist: The Triage agent transfers the query to the appropriate specialist agent.

  4. E-commerce agent OR Airline agent: Depending on the nature of the query, it's handled by either the E-commerce agent or the Airline agent.

  5. Final answer + Performance data: The specialist agent provides a final answer, and performance data is collected for evaluation.

Before we dive in and get our hands dirty, let's understand make a quick comparison of Single Agent and Multi Agent Systems, detailing their respective characteristics, advantages, and limitations across various operational aspects.

Comparison: Single Agent vs Multi Agent

Aspect Single Agent system Multi-Agent system
Context management Maintains full conversation history Context must be shared between agents
Processing speed One task at a time Multiple tasks simultaneously
Token costs ~4x baseline token usage ~15x baseline token usage
System reliability Consistent, deterministic outputs Unpredictable agent interactions
Troubleshooting Clear execution path to follow Multiple failure points to investigate
Agent coordination No coordination overhead Requires careful orchestration
Main advantage Coherent reasoning & dependable results Speed through specialization & parallelism
Biggest limitation Token limits & slower throughput Lost context & coordination overhead
Ideal use cases Deep analysis, content creation, code refactoring Data gathering, research, distributed processing
Examples Writing technical documentation, debugging complex issues Competitive analysis, scraping company data at scale

Building your agent fleet

Let's dive in by starting with domain experts

1. Specialist agents

Create specialized agents for different domains to handle specific types of queries.

# E-commerce Agent - knows online shopping inside out
ecommerce_agent = Agent(
    name="EcommerceAgent",
    instructions="""
    You are a customer support agent for an e-commerce platform that classifies user queries into one of the following categories:
    - Refunds: Questions about returning products, getting refunds, or return policies
    - Informational: General questions about products, policies, or company information
    - Shipping: Questions about order tracking, delivery, shipping costs, or delivery issues
    - Account: Questions about user accounts, passwords, login issues, or profile management
    
    Respond with ONLY the category name that best matches the user's query.
    """
)

# Airline Agent - aviation specialist  
airline_agent = Agent(name="AirlineAgent", instructions="...")

2. Individual agent functions

Develop separate evaluation functions for each agent to process and classify queries within their expertise.

async def classify_ecommerce_query(question: str) -> str:
    """Runs the ecommerce agent for classification"""
    result = await Runner.run(ecommerce_agent, question)
    return result.final_output.strip()

async def classify_airline_query(question: str) -> str:
    """Runs the airline agent for classification"""
    result = await Runner.run(airline_agent, question)
    return result.final_output.strip()

3. Individual agent datasets

Construct datasets tailored to each specialist for targeted testing and evaluation.

# E-commerce Dataset for Individual Testing
ecommerce_cases = [
    Case(
        name='ecommerce_refund_query',
        inputs='How do I return a product?',
        expected_output='Refunds'
    ),
    Case(
        name='ecommerce_informational_query',
        inputs='What is your return policy?',
        expected_output='Informational'
    ),
    # ... more test cases
]

ecommerce_dataset = Dataset(cases=ecommerce_cases)
ecommerce_dataset.add_evaluator(IsExactMatch())

4. Custom evaluator

Implement a custom evaluator to assess the accuracy of agent responses against expected outputs.

from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext

@dataclass
class IsExactMatch(Evaluator):
    async def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:  
        if ctx.output == ctx.expected_output:
            return 1.0  # Perfect match
        return 0.0

5. The Triage agent (The magic happens here)

Design a central triage agent to route incoming queries to the appropriate specialist agents.

# Triage Agent - decides where queries go
triage_agent = Agent(
    name="TriageAgent",
    instructions="""
    You are a customer support triage agent that determines whether a customer query should be handled by:
    1. E-commerce support team (for online shopping, returns, shipping, account issues)
    2. Airline support team (for flights, bookings, seats, baggage, check-in)
    
    Based on the user's query, determine which team should handle it and transfer to the appropriate specialist.
    Do not attempt to answer the query yourself - always hand off to the appropriate specialist.
    """,
    handoffs=[ecommerce_agent, airline_agent]  # This enables automatic handoffs
)

Why handoffs are better than manual routing:

  • Context gets preserved automatically between agents
  • Built-in error handling for failed transfers
  • No weird conversation breaks or lost context
  • Production-ready reliability out of the box

6. Triage system function

Create a function to manage the entire triage process, from initial query to final classification.

async def run_triage_system(question: str) -> str:
    """
    Runs the triage agent which will handoff to the appropriate specialist.
    Returns the final classification from the specialist agent.
    """
    result = await Runner.run(triage_agent, question)
    return result.final_output.strip()

The metadata game changer

This changed everything for me: Most teams can't figure out WHY their agents fail. Metadata fixes that.

Here's the breakthrough - I use pydantic_evals metadata to track routing AND classification performance at the same time:

# Complete Combined dataset with metadata
combined_cases = [
    # E-commerce queries
    Case(
        name='triage_refund_query',
        inputs='How do I return a product?',
        expected_output='Refunds',
        metadata={'domain': 'ecommerce', 'category': 'refunds'}
    ),
    Case(
        name='triage_informational_query',
        inputs='What is your return policy?',
        expected_output='Informational',
        metadata={'domain': 'ecommerce', 'category': 'informational'}
    ),
    Case(
        name='triage_shipping_query',
        inputs='How do I track my order?',
        expected_output='Shipping',
        metadata={'domain': 'ecommerce', 'category': 'shipping'}
    ),
    Case(
        name='triage_account_query',
        inputs='How do I change my password?',
        expected_output='Account',
        metadata={'domain': 'ecommerce', 'category': 'account'}
    ),
    # Airline queries
    Case(
        name='triage_faq_query',
        inputs='How much weight can I carry in baggage?',
        expected_output='FAQ',
        metadata={'domain': 'airline', 'category': 'faq'}
    ),
    Case(
        name='triage_booking_query',
        inputs='I want to book a flight from London to Bangalore',
        expected_output='Booking',
        metadata={'domain': 'airline', 'category': 'booking'}
    ),
    Case(
        name='triage_seat_change_query',
        inputs='Can I change my seat to a window seat?',
        expected_output='Seat change',
        metadata={'domain': 'airline', 'category': 'seat_change'}
    ),
    Case(
        name='triage_flight_change_query',
        inputs='I need to change my flight to next week',
        expected_output='Flight change',
        metadata={'domain': 'airline', 'category': 'flight_change'}
    ),
]

combined_dataset = Dataset(cases=combined_cases)
combined_dataset.add_evaluator(IsExactMatch())

Why this transforms everything

The metadata enables two-level performance analysis:

  1. Routing Level: Did the triage agent choose the right specialist?

    • domain: 'ecommerce' vs domain: 'airline'
  2. Classification Level: Did the specialist categorize correctly within their domain?

    • category: 'refunds', 'shipping', 'faq', 'booking', etc.

Before metadata: "System is 75% accurate" (completely useless for debugging)
After metadata: "Routing is 95% accurate, but ecommerce refund classification needs work" (actionable insights!)

See everything in action

# Get the evaluation report
report = await combined_dataset.evaluate(run_triage_system)

# Show metadata in the results table - instant insights!
report.print(include_metadata=True)

This produces output that actually helps you debug:

Case ID Metadata Scores
triage_refund_query IsExactMatch: 1.00
triage_faq_query IsExactMatch: 1.00

📊 This is what separates toys from production systems.

My complete evaluation process

Step 1: Test individual agents first (Baseline performance)

Establish baseline performance for each specialist agent before integrating them into the triage system.

# Test E-commerce Agent individually
print("Testing E-commerce Agent...")
ecommerce_report = await ecommerce_dataset.evaluate(classify_ecommerce_query)
print(ecommerce_report)

# Test Airline Agent individually  
print("Testing Airline Agent...")
airline_report = await airline_dataset.evaluate(classify_airline_query)
print(airline_report)

This tells me if my specialists are working correctly before I add the complexity of routing.

Step 2: Test the full triage pipeline

Evaluate the complete multi-agent system to assess end-to-end performance, including routing accuracy.

# Test the complete triage system with mixed domain queries
print("Testing Complete Triage System...")
combined_report = await combined_dataset.evaluate(run_triage_system)
combined_report.print(include_metadata=True)

This shows me end-to-end performance including routing accuracy.

Step 3: Analyze performance by Domain

Utilize metadata to perform in-depth analysis of performance patterns across different domains and categories.

# Analyze performance distribution by domain
ecommerce_cases = [c for c in combined_cases if c.metadata.get('domain') == 'ecommerce']
airline_cases = [c for c in combined_cases if c.metadata.get('domain') == 'airline']

print(f"E-commerce test cases: {len(ecommerce_cases)}")
print(f"Airline test cases: {len(airline_cases)}")

# You can also analyze by category within each domain
ecommerce_categories = set(c.metadata.get('category') for c in ecommerce_cases)
airline_categories = set(c.metadata.get('category') for c in airline_cases)

print(f"E-commerce categories covered: {ecommerce_categories}")
print(f"Airline categories covered: {airline_categories}")

Conclusion

In this blog post, we demonstrated a robust multi-agent triage system for handling diverse customer queries across different domains. We showcased how to set up specialist agents, implement a triage mechanism, and utilize metadata for in-depth performance analysis. This approach provides a scalable framework that can be easily adapted to different industries or expanded to include additional domains and query types.