Eval cho agent: trace, replay, golden set, regression

Tháng đầu tiên agent lên production, team tôi không có eval. Khi có bug, chúng tôi reproduce bằng cách kể lại câu chuyện: “hình như user gõ đại loại như này, rồi agent làm cái gì đó sai”. Không ai biết chính xác LLM đã call tool gì, với args nào, nhận về response gì. Chúng tôi fix bằng cảm tính, push, và hy vọng.

Sau đó một sprint, con số pass rate hở ra khi chúng tôi mới bắt đầu đo: 71%.

Ba tháng sau, với trace logging, replay, golden set 80 task, và regression suite chạy trong CI: 89%, và chúng tôi biết chính xác cái gì còn thiếu để lên 92%.

Bài này là ghi lại hành trình đó.

Vì sao eval agent khó hơn eval LLM thường

Nếu bạn đã đọc LLM series bài 30 về evaluation, bạn biết eval LLM đơn giản về mặt cơ học: cho model một câu hỏi, so output với ground truth, tính accuracy. Deterministic, reproducible, dễ automate.

Agent khác ở ba điểm:

1. Nondeterministic. Cùng một input, agent có thể đi hai đường khác nhau nhưng cả hai đều ra output đúng. Path A dùng 3 tool calls, Path B dùng 5. Kết quả giống nhau nhưng token dùng khác, latency khác. Eval không thể chỉ so output cuối.

2. Multi-step. Mỗi step LLM có thể đúng nhưng cả chain sai. Tool call 1 trả về response đúng, tool call 2 dùng output của tool call 1 nhưng parse sai, tool call 3 nhận rác vào, output cuối sai. Nếu chỉ nhìn output cuối, bạn không biết lỗi ở đâu.

3. Tool calls có side effect. Eval LLM là read-only: hỏi, đọc answer. Eval agent mà không kiểm soát thì agent sẽ gọi send_email, delete_record, create_issue với test data. Phải mock tool layer trước.

Đây là lý do tại sao eval agent cần infrastructure riêng, không thể dùng lại harness eval LLM đơn giản.

Cũng cần phân biệt rõ: bài này nói về eval, không phải self-reflection. Self-reflection là agent tự nhìn lại bước vừa làm và quyết định có cần retry không, đó là runtime behavior của agent (xem bài 9 về self-reflection). Eval là offline measurement: bạn chạy agent trên tập test, đo quality, quyết định deploy hay không.

Kỹ thuật 1: trace logging

Trace là bản ghi đầy đủ mọi thứ xảy ra trong một lần chạy agent. Không phải log thông thường (chỉ có error), không phải metrics (chỉ có số), mà là cấu trúc dữ liệu có thể replay lại được.

Một trace tối thiểu có cấu trúc sau:

from dataclasses import dataclass, field
from datetime import datetime
from typing import Any
import uuid
import json


@dataclass
class ToolCall:
    tool_name: str
    input: dict
    output: Any
    error: str | None
    duration_ms: float


@dataclass
class LLMCall:
    messages_in: list[dict]
    response_out: list[dict]
    stop_reason: str
    input_tokens: int
    output_tokens: int
    duration_ms: float


@dataclass
class AgentTrace:
    trace_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    task_input: str = ""
    final_output: str = ""
    success: bool = False
    steps: list[LLMCall | ToolCall] = field(default_factory=list)
    total_tokens: int = 0
    total_duration_ms: float = 0.0
    created_at: str = field(default_factory=lambda: datetime.now().isoformat())
    metadata: dict = field(default_factory=dict)

    def to_json(self) -> str:
        return json.dumps(self.__dict__, default=str, indent=2)

Cách tích hợp trace vào agent loop bằng decorator:

import time
from functools import wraps


def traced_agent(func):
    """Decorator bọc agent function để ghi trace."""
    @wraps(func)
    def wrapper(task_input: str, trace_store: list | None = None, **kwargs):
        trace = AgentTrace(task_input=task_input)
        start = time.time()

        # Inject trace vào kwargs để agent loop sử dụng
        result = func(task_input, trace=trace, **kwargs)

        trace.final_output = result or ""
        trace.total_duration_ms = (time.time() - start) * 1000
        trace.total_tokens = sum(
            s.input_tokens + s.output_tokens
            for s in trace.steps
            if isinstance(s, LLMCall)
        )

        if trace_store is not None:
            trace_store.append(trace)

        return result
    return wrapper


def record_llm_call(trace: AgentTrace, messages_in, response, duration_ms):
    """Ghi một LLM call vào trace."""
    trace.steps.append(LLMCall(
        messages_in=messages_in,
        response_out=[b.__dict__ for b in response.content],
        stop_reason=response.stop_reason,
        input_tokens=response.usage.input_tokens,
        output_tokens=response.usage.output_tokens,
        duration_ms=duration_ms,
    ))


def record_tool_call(trace: AgentTrace, tool_name, tool_input, output, error, duration_ms):
    """Ghi một tool call vào trace."""
    trace.steps.append(ToolCall(
        tool_name=tool_name,
        input=tool_input,
        output=output,
        error=error,
        duration_ms=duration_ms,
    ))

Lưu trace ra disk hoặc vào database sau mỗi run. Với volume thấp, JSON file theo ngày là đủ. Với production traffic, Postgres có JSONB column hoặc ClickHouse cho time-series analytics.

Pitfall tôi gặp: trace quá chi tiết. Ban đầu tôi log toàn bộ messages_in mỗi LLM call, bao gồm cả history từ đầu. Một task 10 bước có history nhân dần: step 1 ghi 1KB, step 2 ghi 2KB, …, step 10 ghi 10KB. Một trace có thể lên đến 50KB. Với 1000 trace/ngày là 50MB JSON thuần. Sau một tuần, disk đầy. Fix: chỉ log delta (messages mới thêm vào mỗi bước), reconstruct full history khi cần replay.

Kỹ thuật 2: replay trace với mock LLM

Replay là kỹ thuật chạy lại một trace đã có, nhưng thay thế LLM bằng mock trả về đúng response đã được lưu. Mục đích: kiểm tra xem code xử lý tool calls và logic xung quanh có đúng không, mà không cần gọi LLM thật (tốn tiền, nondeterministic, chậm).

from collections import deque
from anthropic import Anthropic
from unittest.mock import MagicMock


def build_mock_client_from_trace(trace: AgentTrace) -> Anthropic:
    """
    Tạo mock Anthropic client từ trace đã lưu.
    Mỗi lần gọi messages.create sẽ trả về LLM response tiếp theo từ trace.
    """
    llm_calls = deque([s for s in trace.steps if isinstance(s, LLMCall)])

    def mock_create(**kwargs):
        if not llm_calls:
            raise ValueError("Mock exhausted: more LLM calls than expected")
        recorded = llm_calls.popleft()
        response = MagicMock()
        response.content = _reconstruct_content(recorded.response_out)
        response.stop_reason = recorded.stop_reason
        response.usage = MagicMock(
            input_tokens=recorded.input_tokens,
            output_tokens=recorded.output_tokens,
        )
        return response

    mock_client = MagicMock(spec=Anthropic)
    mock_client.messages.create.side_effect = mock_create
    return mock_client


def replay_trace(trace: AgentTrace, agent_func, mock_tools: dict | None = None):
    """
    Replay một trace với mock LLM.
    mock_tools: dict tool_name -> callable để override tool implementations.
    """
    mock_client = build_mock_client_from_trace(trace)
    replay_trace_out = AgentTrace(task_input=trace.task_input)

    result = agent_func(
        task_input=trace.task_input,
        client=mock_client,
        mock_tools=mock_tools,
        trace=replay_trace_out,
    )

    return result, replay_trace_out

Replay dùng cho hai việc:

Bug investigation: khi user báo cáo lỗi, lấy trace của họ, replay lại locally, đặt breakpoint, debug như normal code. Không cần reproduce môi trường production.
Regression test cho code path: khi bạn sửa tool handler hoặc agent loop logic, replay trace cũ để verify code mới xử lý sequence tool calls giống như code cũ (hoặc tốt hơn một cách có kiểm soát).

Replay không thể test thay đổi hành vi LLM. Nếu bạn thay model hoặc system prompt, trace cũ không còn valid vì LLM sẽ ra response khác. Đó là lúc cần golden set.

Kỹ thuật 3: golden set

Golden set là tập 50-100 task có ground truth rõ ràng: input đã biết, expected output đã định nghĩa, kết quả chấp nhận được đã ghi lại.

Cấu trúc một golden case:

@dataclass
class GoldenCase:
    case_id: str
    description: str
    input: str
    expected_tool_calls: list[dict]   # Tool nào được gọi, args xấp xỉ gì
    expected_output_contains: list[str]  # Strings phải có trong output
    expected_output_excludes: list[str]  # Strings không được xuất hiện
    max_steps: int = 15               # Vượt ngưỡng này là fail
    max_tokens: int = 8000
    tags: list[str] = field(default_factory=list)

Ví dụ một golden case cho agent quản lý task:

GoldenCase(
    case_id="TC-001",
    description="Tạo task mới và assign cho user",
    input="Tạo task 'Review PR #234' và assign cho alice@company.com",
    expected_tool_calls=[
        {"tool": "create_task", "args_contains": {"title": "Review PR #234"}},
        {"tool": "assign_task", "args_contains": {"assignee": "alice@company.com"}},
    ],
    expected_output_contains=["task", "alice"],
    expected_output_excludes=["error", "failed", "không thể"],
    max_steps=6,
    tags=["create", "assign", "happy-path"],
)

Viết evaluator chạy golden set:

import re
from dataclasses import dataclass


@dataclass
class EvalResult:
    case_id: str
    passed: bool
    score: float          # 0.0 đến 1.0 cho partial credit
    failures: list[str]
    trace: AgentTrace


def evaluate_case(case: GoldenCase, agent_func, client) -> EvalResult:
    """Chạy một golden case và tính score."""
    trace_store = []
    output = agent_func(
        task_input=case.input,
        client=client,
        trace_store=trace_store,
    )
    trace = trace_store[0] if trace_store else AgentTrace()
    failures = []
    scores = []

    # Check output contains
    for expected in case.expected_output_contains:
        if expected.lower() in output.lower():
            scores.append(1.0)
        else:
            scores.append(0.0)
            failures.append(f"Output thiếu: '{expected}'")

    # Check output excludes
    for excluded in case.expected_output_excludes:
        if excluded.lower() not in output.lower():
            scores.append(1.0)
        else:
            scores.append(0.0)
            failures.append(f"Output chứa chuỗi bị cấm: '{excluded}'")

    # Check tool calls sequence
    actual_tool_calls = [s for s in trace.steps if isinstance(s, ToolCall)]
    for expected_call in case.expected_tool_calls:
        matched = _check_tool_call_match(actual_tool_calls, expected_call)
        scores.append(1.0 if matched else 0.0)
        if not matched:
            failures.append(f"Tool call không khớp: {expected_call}")

    # Check step count
    step_count = len([s for s in trace.steps if isinstance(s, LLMCall)])
    if step_count <= case.max_steps:
        scores.append(1.0)
    else:
        scores.append(0.0)
        failures.append(f"Quá nhiều steps: {step_count} > {case.max_steps}")

    score = sum(scores) / len(scores) if scores else 0.0
    passed = len(failures) == 0

    return EvalResult(
        case_id=case.case_id,
        passed=passed,
        score=score,
        failures=failures,
        trace=trace,
    )


def _check_tool_call_match(actual_calls: list[ToolCall], expected: dict) -> bool:
    """Kiểm tra xem expected tool call có xuất hiện trong actual không."""
    for call in actual_calls:
        if call.tool_name != expected["tool"]:
            continue
        # Partial match: chỉ check các key trong args_contains
        args_ok = all(
            str(v).lower() in str(call.input.get(k, "")).lower()
            for k, v in expected.get("args_contains", {}).items()
        )
        if args_ok:
            return True
    return False

Partial credit quan trọng hơn pass/fail binary. Một case có 5 assertion, agent làm đúng 4: score là 0.8, không phải 0. Với 80 case, pass rate và average score cho hai góc nhìn khác nhau. Pass rate 85% nhưng average score 0.93 nghĩa là các case fail chỉ thiếu một assertion nhỏ, không phải fail hoàn toàn. Pass rate 85% nhưng average score 0.70 là vấn đề khác.

Pitfall: golden set drift. Sáu tháng sau khi build golden set, chúng tôi nhận ra 15 case đã lỗi thời: product thay đổi flow, tool có tên mới, một số expected output không còn hợp lệ. Pass rate tụt xuống 76%, nhưng một phần vì golden set sai, không phải agent kém đi. Phải review và update golden set mỗi sprint hoặc khi có major product change. Treat golden set như test code: nó cần được maintain.

Kỹ thuật 4: regression suite trong CI

Regression suite là tập hợp golden case chạy tự động trước mỗi lần deploy. Không phải chạy toàn bộ 80-100 case (chậm và tốn tiền), mà chọn lọc 20-30 case critical path chạy trong CI.

Cấu trúc một regression run:

import sys


def run_regression_suite(
    suite: list[GoldenCase],
    agent_func,
    client,
    pass_threshold: float = 0.85,
) -> bool:
    """
    Chạy regression suite, trả về True nếu đủ điều kiện deploy.
    pass_threshold: tỉ lệ case pass tối thiểu.
    """
    results = []
    for case in suite:
        result = evaluate_case(case, agent_func, client)
        results.append(result)
        status = "PASS" if result.passed else "FAIL"
        print(f"[{status}] {case.case_id}: {case.description}")
        if not result.passed:
            for f in result.failures:
                print(f"  - {f}")

    total = len(results)
    passed = sum(1 for r in results if r.passed)
    pass_rate = passed / total
    avg_score = sum(r.score for r in results) / total

    print(f"\nResults: {passed}/{total} passed ({pass_rate:.1%})")
    print(f"Average score: {avg_score:.3f}")

    if pass_rate < pass_threshold:
        print(f"FAIL: pass rate {pass_rate:.1%} < threshold {pass_threshold:.1%}")
        return False

    print(f"PASS: meets threshold {pass_threshold:.1%}")
    return True


# Trong CI script:
if __name__ == "__main__":
    from your_agent import agent_func, build_client
    from your_golden_set import REGRESSION_SUITE

    ok = run_regression_suite(
        suite=REGRESSION_SUITE,
        agent_func=agent_func,
        client=build_client(),
        pass_threshold=0.85,
    )
    sys.exit(0 if ok else 1)

Tích hợp vào GitHub Actions:

# .github/workflows/agent-regression.yml
name: Agent Regression

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -r requirements.txt
      - run: python tests/regression_suite.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Lưu ý quan trọng: regression suite trong CI phải gọi LLM thật. Không phải replay từ trace. Mục đích là phát hiện regression từ model update, prompt change, hoặc tool change. Dùng model nhỏ hơn cho CI để tiết kiệm: Haiku thay vì Sonnet nếu task đủ đơn giản, hoặc budget token nhỏ hơn.

Với 25 case CI, mỗi case trung bình 500 output tokens, Haiku giá $0.80/MTok output: 25 x 500 = 12,500 tokens = $0.01 mỗi lần chạy CI. Rất rẻ để có safety net.

Những metric nên theo dõi

Bốn số cần dashboard:

Metric	Tính như thế nào	Threshold ví dụ
Pass rate	Số case pass / tổng case	>= 85%
Average score	Trung bình partial credit score	>= 0.90
P95 step count	95th percentile số LLM calls per task	<= 12
P95 token per task	95th percentile tổng token per task	<= 10,000

Pass rate và average score đo quality. Step count và token đo efficiency. Cả bốn cần xem cùng nhau: pass rate cao nhưng P95 token tăng gấp đôi sau một sprint là dấu hiệu agent đang “overthink” vì prompt change nào đó.

Tooling thực tế

Ba công cụ phổ biến cho agent observability và eval:

Tool	Điểm mạnh	Điểm yếu	Giá
Langfuse	Open source, self-host được, trace UI đẹp, dataset/eval built-in	Cần setup nếu self-host	Free self-host; cloud từ $0
Phoenix (Arize)	Trace + eval trong một, OpenTelemetry native, local dev dễ	Ít dùng trong community VN	Free local; cloud có phí
Braintrust	Dataset management tốt nhất, CI/CD integration sẵn, human eval UI	Không tự host được	Từ $0 (có free tier)

Khuyến nghị thực tế: nếu team nhỏ hoặc dữ liệu sensitive, dùng Langfuse self-hosted trên một VPS nhỏ. Nếu startup cần move fast và không muốn ops, Braintrust cloud là lựa chọn tốt. Phoenix phù hợp nếu team đã dùng OpenTelemetry stack.

Trước khi dùng bất kỳ tool nào, đảm bảo bạn đã có trace structure rõ ràng (Kỹ thuật 1). Tool chỉ giúp visualize và query, không thể thay thế việc thiết kế trace đúng.

Cách tôi sẽ triển khai theo giai đoạn

Không cần làm cả bốn kỹ thuật cùng lúc. Thứ tự hợp lý:

Tuần 1-2: Thêm trace logging vào agent đang chạy. Chưa cần làm gì với trace, chỉ cần data. Sau một tuần, bạn có đủ trace thật để hiểu agent đang làm gì.

Tuần 3-4: Từ trace thật, pick 20 case đại diện, viết thành golden case. Chạy manual, check pass rate. Đây là baseline.

Tháng 2: Thêm replay để debug bug nhanh hơn. Thêm CI chạy golden set khi open PR.

Tháng 3+: Mở rộng golden set lên 80-100 case. Thêm partial credit scoring. Setup Langfuse hoặc tool tương đương để team có thể browse trace.

Bốn kỹ thuật này không phải lý thuyết. Chúng là minimum viable eval infrastructure mà một team 2-3 người có thể build trong 4-6 tuần trong khi vẫn ship feature.

Bảng so sánh nhanh eval tools

Tính năng	Tự build (code trên)	Langfuse	Phoenix	Braintrust
Trace logging	Code thủ công	SDK tích hợp	SDK tích hợp	SDK tích hợp
Trace UI	Không có	Rất tốt	Tốt	Tốt
Dataset management	JSON files	Built-in	Built-in	Rất tốt
CI/CD integration	Script Python	Script + API	Script + API	Native
Self-host	Có (là code của bạn)	Có	Có	Không
Human eval UI	Không có	Cơ bản	Không	Rất tốt
Effort setup	Cao	Trung bình	Thấp	Thấp

Chốt lại: đừng deploy mù

Eval agent là phần bị bỏ qua nhiều nhất khi team build agent. Không phải vì không biết cần, mà vì nó không sexy bằng build thêm feature. Nhưng đây là phần quyết định bạn có thể ship với confidence hay không.

Bốn kỹ thuật: trace logging để hiểu agent đang làm gì, replay để debug nhanh, golden set để đo quality, regression suite để không bao giờ deploy mù. Không cần phải hoàn hảo từ đầu. Bắt đầu từ trace logging, phần còn lại sẽ xây dần lên.

Nếu chưa có trace, mọi tranh luận về quality đều là cảm giác. Khi đã đo được agent làm đúng hay sai, bước kế tiếp là đo nó tốn bao nhiêu tiền và chậm ở đâu: Cost và latency: token budget, streaming, prompt caching.