On-call cho agent: monitoring, alerts, rollback, A/B test

Lần đầu tôi on-call cho một agent production, 2 giờ sáng tôi nhận được alert: “cost anomaly detected, spending 40x baseline”. Tôi bật laptop, mở dashboard, thấy một agent đang chạy vòng lặp gần như vô hạn vì một tool trả về lỗi malformed JSON mà code chưa handle. Mỗi vòng loop tốn 8000 token. Trong 45 phút, nó đã chạy 340 vòng trên 200 concurrent request.

Không phải agent bug. Là không có circuit breaker, không có cost anomaly alert đủ nhạy, không có rollback plan. Ba thiếu sót đó gộp lại thành một sự cố $1,400 trong một đêm.

Bài này là phần cuối của series. Khi agent đã build xong (bài 1), đã eval được (bài 21), đã tối ưu cost (bài 22), đã biết failure modes (bài 23), câu hỏi còn lại rất đời: 2 giờ sáng nó báo lỗi thì bạn nhìn vào đâu trước?

Những metric tôi muốn thấy đầu tiên

Agent không phải web service thông thường. Metric cho web service (request rate, error rate, latency) vẫn cần, nhưng không đủ. Agent có thêm bốn lớp metric riêng.

Task-level metrics

Đây là metric quan trọng nhất, cũng ít người đo nhất.

Task success rate: Tỉ lệ task hoàn thành đúng mục tiêu. Khác với HTTP 200. Agent trả về response mà không crash là 200, nhưng task có thể thất bại vì LLM hiểu sai ý định, gọi tool sai arg, hoặc loop đủ vòng rồi timeout. Thường phải tự định nghĩa “success” theo business context:

def is_task_success(task_result: TaskResult) -> bool:
    # Business-specific success criteria
    if task_result.goal_type == "code_generation":
        return task_result.output.contains_runnable_code and \
               task_result.test_passed
    if task_result.goal_type == "data_extraction":
        return task_result.output.fields_extracted == task_result.expected_fields
    return task_result.final_state == "completed"

Task completion rate vs abandon rate: Bao nhiêu task chạy đến finish, bao nhiêu bị abandon giữa chừng (user cancel, timeout, max_iterations exceeded). Abandon rate tăng đột ngột là dấu hiệu agent đang stuck trên loại input mới.

Goal alignment score: Nếu dùng LLM-as-judge để eval (bài 21), đây là metric cần stream liên tục ra dashboard, không chỉ dùng lúc offline eval.

Cost metrics

Đã đề cập kỹ ở bài 22, nhưng trong monitoring context, cần thêm:

Cost per task: Tổng token input + output mỗi task, quy ra USD. Metric này phải có breakdown theo task type. Agent debug code tốn 5x hơn agent summarize email là bình thường. Nhưng nếu cả hai cùng tăng 3x so với baseline tuần trước thì có vấn đề.

Iterations per task: Số vòng loop trung bình. P50/P99. Khi iterations tăng mà task success không tăng, thường là LLM đang “vật lộn” với input mới mà prompt chưa cover.

Token waste ratio: Tỉ lệ token tốn vào retry và error recovery so với tổng. Nếu 30% token là retry, prompt hoặc tool schema đang có vấn đề.

Latency metrics

Latency P50/P99 end-to-end: Từ lúc user gửi request đến lúc nhận response cuối. Agent phức tạp có thể mất 30-60 giây. P99 quan trọng hơn P50 vì outlier thường là case agent bị stuck nhiều vòng.

Time-to-first-token: Khi dùng streaming, metric này phản ánh cảm nhận responsiveness của user, dù total latency vẫn cao.

Tool execution latency: Breakdown latency theo từng tool. Tool gọi external API chậm kéo cả chain chậm. Monitor tool latency riêng giúp tìm bottleneck đúng chỗ.

Hallucination-adjacent metrics

Confidence distribution: Nếu agent tự chấm điểm confidence (nhiều agent có bước verify trước khi commit action), track phân bố score này. Nếu distribution shift về phía low confidence, LLM đang gặp input ngoài training distribution.

Tool call validity rate: Tỉ lệ tool call có đủ required fields và đúng type. LLM hallucinate schema sai thường là dấu hiệu prompt drift hoặc model version thay đổi.

Unexpected refusal rate: LLM từ chối thực hiện request mà không có lý do rõ ràng. Metric này tăng khi safety filter của model provider thay đổi (hay xảy ra sau model update).

Alert setup: đừng để dashboard chỉ để ngắm

Nguyên tắc alert tôi dùng

Alert quá nhiều thì không ai xem. Alert quá ít thì incident xảy ra không biết. Với agent, ba loại alert cần thiết theo priority:

Priority 1 (PagerDuty / page on-call):

Cost anomaly: spending rate > 10x baseline trong 5 phút. Số 10x nghe cao nhưng cần tránh false positive do traffic spike hợp lệ. Kết hợp với request rate để phân biệt: nếu spending tăng 10x mà request chỉ tăng 2x, thì mỗi request đang tốn 5x, đó là anomaly thật.
Error rate > 20% trong 10 phút. Threshold cao hơn web service vì agent inherently có failure mode cao hơn.
P99 latency > 5 phút. Nếu P99 vượt ngưỡng này, user trải nghiệm đang rất tệ.

Priority 2 (Slack / non-urgent):

Task success rate giảm > 15% so với 7-day moving average.
Iterations per task tăng > 50% so với baseline.
Hallucination-adjacent metric drift: tool call validity rate giảm dưới 90%.

Priority 3 (Dashboard / next business day):

Cost per task tăng nhẹ nhưng đều (có thể là prompt drift hoặc dữ liệu production khó hơn staging).
Abandon rate tăng từ từ.
Confidence distribution shift.

OpenTelemetry trace cho agent

Đây là code pattern tôi dùng để instrument agent loop với OpenTelemetry:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
import anthropic
import time

provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent.main")

client = anthropic.Anthropic()

def run_agent_with_tracing(task_id: str, user_input: str, max_iter: int = 10):
    with tracer.start_as_current_span("agent.task") as task_span:
        task_span.set_attribute("task.id", task_id)
        task_span.set_attribute("task.input_len", len(user_input))
        task_span.set_attribute("agent.max_iter", max_iter)

        messages = [{"role": "user", "content": user_input}]
        total_input_tokens = 0
        total_output_tokens = 0
        iteration = 0

        for i in range(max_iter):
            iteration = i + 1
            with tracer.start_as_current_span(f"agent.iteration") as iter_span:
                iter_span.set_attribute("iteration.number", iteration)
                t0 = time.time()

                resp = client.messages.create(
                    model="claude-sonnet-4-6",
                    max_tokens=4096,
                    tools=TOOLS,
                    messages=messages,
                )

                iter_latency = time.time() - t0
                iter_span.set_attribute("iteration.latency_ms", iter_latency * 1000)
                iter_span.set_attribute("iteration.input_tokens", resp.usage.input_tokens)
                iter_span.set_attribute("iteration.output_tokens", resp.usage.output_tokens)
                iter_span.set_attribute("iteration.stop_reason", resp.stop_reason)

                total_input_tokens += resp.usage.input_tokens
                total_output_tokens += resp.usage.output_tokens
                messages.append({"role": "assistant", "content": resp.content})

                if resp.stop_reason == "end_turn":
                    iter_span.set_attribute("iteration.final", True)
                    break

                if resp.stop_reason == "tool_use":
                    tool_results = []
                    for block in resp.content:
                        if block.type == "tool_use":
                            with tracer.start_as_current_span("tool.execute") as tool_span:
                                tool_span.set_attribute("tool.name", block.name)
                                t_tool = time.time()
                                result = execute_tool(block.name, block.input)
                                tool_span.set_attribute("tool.latency_ms", (time.time() - t_tool) * 1000)
                                tool_results.append({
                                    "type": "tool_result",
                                    "tool_use_id": block.id,
                                    "content": str(result),
                                })
                    messages.append({"role": "user", "content": tool_results})

        cost_usd = (total_input_tokens * 3.0 + total_output_tokens * 15.0) / 1_000_000
        task_span.set_attribute("task.total_iterations", iteration)
        task_span.set_attribute("task.total_input_tokens", total_input_tokens)
        task_span.set_attribute("task.total_output_tokens", total_output_tokens)
        task_span.set_attribute("task.cost_usd", cost_usd)

        if iteration >= max_iter:
            task_span.set_attribute("task.status", "max_iter_exceeded")
        else:
            task_span.set_attribute("task.status", "completed")

        return messages[-1]["content"] if messages else None

Với trace này, Grafana (hoặc Datadog, Honeycomb) có thể vẽ được: latency per iteration, cost per task, tool breakdown, total iterations. Đủ để làm alert và dashboard.

Phần 3: Dashboard layout

Dashboard cho agent monitoring không phải là Grafana clone của web service dashboard. Cần thiết kế lại cho agent mental model.

Row 1: Health at a glance

Task success rate (7-day sparkline + current)
Cost per task (current vs 7-day baseline)
P50/P99 latency end-to-end
Error rate

Row 2: Agent behavior

Iterations per task distribution (histogram)
Tool call breakdown (pie chart: which tools called most, which fail most)
Abandon rate trend

Row 3: Cost breakdown

Total cost (hourly, daily)
Cost by task type
Token waste ratio (retry tokens / total tokens)

Row 4: Trace explorer

Link sang Jaeger/Tempo để drill down vào trace cụ thể khi có sự cố. Dòng này không phải số, là link. Nhưng cần có mặt trong dashboard.

Rollout agent như rollout một system rủi ro

Vì sao full rollout dễ đau

Với web service, rollout mới là thay code, đợi CI, blue-green hoặc canary traffic shift. Với agent, “version mới” có thể là:

Model version thay đổi (GPT-4o-mini thay GPT-4o)
Prompt thay đổi (rewrite system prompt)
Tool schema thay đổi (thêm tool mới, đổi tên field)
Temperature thay đổi
Max iterations thay đổi

Mỗi loại thay đổi trên ảnh hưởng khác nhau đến behavior. Model mới có thể giỏi hơn trên benchmark nhưng tệ hơn trên task cụ thể của sản phẩm. Prompt rewrite có thể tốt hơn cho 90% case nhưng break 10% edge case quan trọng.

Không thể dùng unit test để bắt tất cả regression vì agent output không deterministic. Chỉ có thể rollout chậm và đo.

Canary pattern cho agent

import hashlib
from typing import Optional

def get_agent_version(user_id: str, canary_percentage: float = 0.1) -> str:
    """
    Deterministic user bucketing cho canary.
    Cùng user_id luôn nhận cùng version trong suốt experiment.
    """
    bucket = int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
    if bucket < (canary_percentage * 100):
        return "v2_canary"
    return "v1_stable"

class AgentRouter:
    def __init__(self, canary_pct: float = 0.0):
        self.canary_pct = canary_pct

    def route(self, user_id: str, task: str) -> str:
        version = get_agent_version(user_id, self.canary_pct)
        if version == "v2_canary":
            return run_agent_v2(task)
        return run_agent_v1(task)

# Rollout stages: 0% → 5% → 10% → 25% → 50% → 100%
# Đo metric ở mỗi stage trước khi tăng thêm
router = AgentRouter(canary_pct=0.05)

Rollout stages: bắt đầu 5%, giữ 24-48 giờ, đo task success rate và cost per task của canary bucket so với control. Nếu canary không tệ hơn 5% trên cả hai metric, tăng lên 10%. Tiếp tục đến 100%.

48 giờ mỗi stage là minimum vì agent traffic có seasonality (peak giờ làm việc, off-peak đêm). Cần đo đủ một chu kỳ 24 giờ để kết quả có nghĩa.

A/B test framework

Pitfall tôi thấy nhiều nhất: traffic split sai

Đây là sai lầm tôi đã làm và thấy team khác làm nhiều nhất: split traffic theo request, không theo user.

# SAI: split theo request
import random

def get_variant_wrong(request_id: str) -> str:
    # Mỗi request ngẫu nhiên rơi vào A hoặc B
    return "A" if random.random() < 0.5 else "B"

Vấn đề: cùng một user trong cùng một session có thể nhận A rồi B rồi A. Agent B có thể có behavior khác hoàn toàn A, user thấy inconsistent, trải nghiệm tệ. Tệ hơn, metric không đo được vì contamination giữa hai group.

# ĐÚNG: split theo user, deterministic
def get_variant_correct(user_id: str, experiment_id: str) -> str:
    seed = f"{experiment_id}:{user_id}"
    bucket = int(hashlib.md5(seed.encode()).hexdigest(), 16) % 100
    return "A" if bucket < 50 else "B"

Với split theo user, cùng user luôn nhận cùng variant trong suốt experiment. Trải nghiệm nhất quán. Metric sạch.

Cấu trúc một A/B test đúng hơn

from dataclasses import dataclass
from datetime import datetime
import uuid

@dataclass
class ExperimentResult:
    experiment_id: str
    user_id: str
    variant: str
    task_id: str
    timestamp: datetime
    task_success: bool
    cost_usd: float
    iterations: int
    latency_ms: float

class AgentExperiment:
    def __init__(self, experiment_id: str, control_fn, treatment_fn,
                 treatment_pct: float = 0.5):
        self.experiment_id = experiment_id
        self.control = control_fn
        self.treatment = treatment_fn
        self.treatment_pct = treatment_pct

    def run(self, user_id: str, task: str) -> tuple[str, ExperimentResult]:
        variant = get_variant_correct(user_id, self.experiment_id)
        task_id = str(uuid.uuid4())

        t0 = time.time()
        if variant == "B":
            result, metadata = self.treatment(task)
        else:
            result, metadata = self.control(task)
        latency_ms = (time.time() - t0) * 1000

        exp_result = ExperimentResult(
            experiment_id=self.experiment_id,
            user_id=user_id,
            variant=variant,
            task_id=task_id,
            timestamp=datetime.now(),
            task_success=metadata["success"],
            cost_usd=metadata["cost_usd"],
            iterations=metadata["iterations"],
            latency_ms=latency_ms,
        )
        log_experiment_result(exp_result)
        return result, exp_result

Minimum sample size

Agent task success rate thường 70-85%. Để detect sự khác biệt 5% (absolute) với power 80%, cần khoảng 800-1000 sample mỗi variant. Với traffic thấp (100 task/day), thì cần 8-10 ngày chạy. Đừng conclude sớm vì p-value nhìn ngon sau 2 ngày, nhưng chưa đủ sample.

Rollback plan phải có trước khi deploy

Pin model version

Provider release model mới liên tục. Anthropic ra Claude Sonnet 5 thì Claude Sonnet 4.6 vẫn available một thời gian, nhưng không mãi mãi. Chiến lược: pin version cụ thể trong config, không dùng alias “latest”.

# Không làm thế này
model = "claude-sonnet-latest"

# Làm thế này
MODEL_VERSION = os.getenv("AGENT_MODEL_VERSION", "claude-sonnet-4-6-20250514")

Khi muốn upgrade model, tạo experiment riêng, đo metric, rồi mới update config. Không upgrade trực tiếp production.

Pin prompt version

Prompt trong production phải được version control như code. Không hardcode prompt trong code. Load từ config hoặc prompt registry:

class PromptRegistry:
    def __init__(self):
        self._prompts = {}

    def register(self, name: str, version: str, text: str):
        key = f"{name}:{version}"
        self._prompts[key] = text

    def get(self, name: str, version: Optional[str] = None) -> str:
        if version:
            return self._prompts[f"{name}:{version}"]
        # Get latest registered version
        matching = [k for k in self._prompts if k.startswith(f"{name}:")]
        if not matching:
            raise KeyError(f"No prompt registered for {name}")
        latest_key = sorted(matching)[-1]
        return self._prompts[latest_key]

registry = PromptRegistry()
registry.register("code_review_agent", "v1.0", "You are a code reviewer...")
registry.register("code_review_agent", "v1.1", "You are a careful code reviewer...")

ACTIVE_PROMPT_VERSION = os.getenv("CODE_REVIEW_PROMPT_VERSION", "v1.1")
system_prompt = registry.get("code_review_agent", ACTIVE_PROMPT_VERSION)

Khi có incident do prompt change, rollback chỉ cần đổi env var, không cần redeploy.

Checklist rollback

Trigger	Rollback action	Time to rollback
Task success rate giảm > 20%	Revert prompt version via env var	< 5 phút
Cost per task tăng > 3x	Revert model version config	< 5 phút
Error rate > 30%	Feature flag tắt agent, fallback về rule-based	< 2 phút
Provider outage	Switch sang backup provider	10-15 phút

Bảng tôi muốn có trong runbook

Metrics table

Metric	Type	Collect	Alert threshold
Task success rate	Task-level	Per task	< 70% (7-day avg)
Cost per task	Cost	Per task	> 3x baseline
Iterations per task (P99)	Cost	Per task	> 2x baseline
Token waste ratio	Cost	Per task	> 30%
Latency P50	Latency	Per request	> 30s
Latency P99	Latency	Per request	> 5 min
Tool call validity rate	Hallucination	Per tool call	< 90%
Abandon rate	Task-level	Per task	> 15% (7-day avg)
Error rate	Reliability	Per request	> 20% in 10 min
Spending rate	Cost	Rolling 5 min	> 10x baseline

Alert routing table

Alert	Severity	Route	Action
Cost anomaly (10x, 5 min)	P1	PagerDuty	Circuit breaker, investigate trace
Error rate > 30% (10 min)	P1	PagerDuty	Rollback hoặc feature flag off
P99 latency > 5 min	P1	PagerDuty	Check tool latency, model status
Success rate drop > 15%	P2	Slack	Investigate, canary rollback
Iterations spike > 50%	P2	Slack	Check prompt, new input patterns
Tool validity < 90%	P2	Slack	Check model version, prompt
Cost trend up (daily)	P3	Dashboard	Review, no immediate action

Closing the loop

Bài 1 của series bắt đầu bằng một định nghĩa đơn giản: agent là LLM cộng tools cộng memory cộng loop. Bài 25 này kết thúc bằng một nhận thức khác: loop đó không kết thúc khi agent trả về response. Loop thật sự bao gồm cả việc đo agent đang làm gì, cảnh báo khi có sự cố, rollback khi hỏng, và cải tiến qua từng experiment.

Đó là lý do bài này đặt tên là “on-call cho agent”. Không phải vì agent cần người trực 24/7 (mục tiêu là ngược lại). Mà vì khi agent vào production, nó cần được đối xử như một service thật: có SLO, có monitoring, có runbook, có on-call rotation.

Nhìn lại 5 phần của series

Part 1: Foundation (bài 1-5). Từ “agent là gì” đến một agent 100 dòng Python chạy được. Mental model quan trọng nhất: LLM không có state, control loop là nơi dễ sai nhất, và max_iterations là budget chứ không phải safety net.

Part 2: Planning và Reasoning (bài 6-10). Từ ReAct đến Plan-and-Execute, Tree of Thoughts, self-reflection. Agent không chỉ phản xạ. Nó có thể plan, verify, và retry có chiến lược.

Part 3: Tools và Environment (bài 11-15). Tool design, code sandbox, browser automation, RAG trong vòng lặp, MCP. Phần này quan trọng cho production vì agent chỉ mạnh bằng tool set nó có, và tool set kém thiết kế là nguồn gốc của hầu hết incident.

Part 4: Multi-agent (bài 16-20). Supervisor, handoff, specialized roles, framework comparison. Multi-agent không phải giải pháp cho mọi thứ. Biết khi nào không nên dùng quan trọng hơn biết cách dùng.

Part 5: Production (bài 21-25). Eval, cost optimization, failure modes, security, và monitoring. Đây là phần thường bị bỏ qua nhất khi build agent lần đầu, và là phần gây ra đau đầu nhiều nhất khi production.

Nếu muốn đi tiếp

Series này dừng ở bài 25, nhưng lĩnh vực agent không dừng. Ba hướng đáng theo dõi:

Reinforcement learning từ human feedback cho agent (RLHF-agent): Thay vì eval offline và tune prompt thủ công, train agent tự cải thiện từ signal của user. Không phải RL mới, nhưng áp dụng cho agent loop với long-horizon reward là bài toán mở.

Constitutional AI và value alignment cho agent: Agent mạnh hơn thì rủi ro alignment tăng lên. Constitutional AI (Anthropic), principle-based guardrails, và scalable oversight là hướng nghiên cứu đang phát triển nhanh. Thực tế hơn cho dev: cách implement value constraint vào agent mà không làm giảm capability.

On-device agent: Agent chạy local trên phone hoặc laptop, không gọi cloud API. Llama 3 8B, Qwen 2.5 7B, Phi-4 chạy được trên Apple Silicon. Privacy-first, latency thấp, offline capable. Trade-off: capability thấp hơn, context window nhỏ hơn, cần quantization. Nhưng trend này sẽ ảnh hưởng đến cách design agent trong 2-3 năm tới.

Mấy bài hands-on đáng làm

Nếu đọc xong series và muốn practice, đây là năm dự án nhỏ đủ để consolidate tất cả concept:

Code review agent: Nhận diff, comment line-by-line, suggest fix. Test với golden set của PR thật. Đo hallucination rate trên file lớn. Implement cost cap per review.
Research agent: Nhận câu hỏi, tự search, tổng hợp nguồn, cite. Dùng RAG pattern từ bài 14. Đo citation accuracy (LLM-as-judge).
Incident response agent: Nhận alert PagerDuty, query logs, suggest root cause. Multi-step, multi-tool. Đây là agent có blast radius cao nhất trong danh sách, dùng tool sandboxing bài 12.
Multi-agent debate: Hai agent tranh luận về một decision, một moderator agent kết luận. Dùng pattern từ bài 16. Đo consensus rate và quality của quyết định so với single-agent.
Production monitoring agent: Agent tự monitor metric của chính nó (meta), phát hiện anomaly, viết incident summary. Đây là bài tổng hợp của toàn bộ Part 5.

Không có bài tiếp theo. Đây là bài cuối. Nhưng agent loop không thật sự kết thúc ở end_turn. Nó kết thúc khi bạn đo được agent đang làm đúng việc, và khi nó làm sai thì người on-call biết rollback ở đâu, tắt gì, và đọc trace nào trước.