Deduplication và throttling trong Kibana: tránh alert fatigue

Channel #alerts của một team SRE từng đạt mốc 14000 message một tuần. Không ai đọc nữa. Khi sự cố thật xảy ra, message trigger nằm trong rừng noise, mất 40 phút mới có người để ý. Đó là alert fatigue: hệ thống cảnh báo mất tác dụng vì bão hoà.

Bài này khép phần alerting của series. Sau khi có rule, connector và SLO, mảnh ghép cuối là làm sao mỗi alert tới on-call đều đáng đọc.

Đọc xong nên nắm được:

Khác biệt giữa alert state, notification và action.
Dedup ở rule level và connector level.
Throttle action để giảm spam mà không miss critical.
Routing theo severity.
Audit và tune rule dựa trên acknowledge rate.

Mental model về notification flow

Trước khi tune, phải biết chính xác chỗ nào đang gây spam.

Sequence một alert đi qua:

1. Rule chạy theo schedule (mỗi 1m)
2. Evaluate condition -> match document/metric
3. Alert instance state đổi (recovered -> active, active -> recovered)
4. Notification triggered (theo notifyWhen setting)
5. Action runs (mỗi connector)
6. Connector dispatches (HTTP/SMTP/PagerDuty)
7. External service deduplicates (nếu support)

Spam có thể xảy ra ở bước 4, 5, 6, 7. Mỗi bước có pattern fix riêng.

notifyWhen

Setting quan trọng nhất ở rule level. Bốn lựa chọn:

notifyWhen	Hành vi	Khi dùng
`onActionGroupChange`	Notify khi state đổi (active->recovered hoặc ngược lại)	Default tốt cho hầu hết case
`onActiveAlert`	Notify mỗi lần rule chạy nếu vẫn active	Cần update liên tục (rare)
`onThrottleInterval`	Notify mỗi N phút nếu vẫn active	Long-running incident, muốn ping reminder
`onActionGroupChange` (mới) + Recovery	Trigger + recover, không spam giữa	Standard

Setup qua UI: trong rule editor, mục Notify chọn dropdown.

Setup qua API:

{
  "notify_when": "onActionGroupChange",
  "throttle": null
}

Hoặc với throttle interval:

{
  "notify_when": "onThrottleInterval",
  "throttle": "1h"
}

Kết hợp: rule check mỗi phút, alert state active liên tục 6 giờ, với throttle: 1h thì on-call nhận đúng 6 ping (mỗi giờ một lần), không phải 360 ping.

Throttle ở action level

Kibana 8.5+ cho phép throttle riêng cho từng action, không chỉ rule level. Quan trọng khi một rule gửi cả Slack (high frequency OK) và PagerDuty (cần dedup chặt).

Trong UI: trong action editor, mục Run when có dropdown:

onActionGroupChange: standard
onActiveAlert: spam mỗi check
Custom: chọn frequency

Pattern điển hình:

Action	Throttle	Lý do
Slack #alerts	10 phút	Channel có thể chịu reminder
Email backup	1 giờ	Email khó chịu nếu quá nhiều
PagerDuty	1 phút (dedupKey)	Page on-call, dedup tại PagerDuty
Webhook log	onActionGroupChange	Chỉ log state change

Group alerts ở rule level

Thay vì rule trigger 100 alert song song (mỗi alert cho 1 entity), gom lại thành 1 alert tổng.

Use case: top N service failing

Rule “Error rate > 5%” với group by service.name. Nếu deploy bad code, 30 service đều fail. Rule sinh 30 alert song song → 30 Slack message → channel spam.

Fix 1: thay vì group by, dùng aggregate filter:

"params": {
  "searchType": "esQuery",
  "esQuery": "...",
  "size": 0,
  "threshold": [100],
  "thresholdComparator": ">",
  "aggType": "count",
  "groupBy": "all"
}

groupBy: all thay vì groupBy: top. Rule sinh 1 alert “total > 100” thay vì 30 alert.

Fix 2: dùng metadata context.groups trong message template để liệt kê:

Alert: high error rate across {{context.groups.length}} services
Services: {{#context.groups}}{{.}}, {{/context.groups}}
View: {{context.viewInAppUrl}}

Một message với context đầy đủ, không phải 30 message rời.

Dedup ở connector level

Mỗi loại connector có cơ chế dedup riêng.

PagerDuty dedupKey

{
  "dedupKey": "{{rule.id}}-{{context.group}}",
  "eventAction": "trigger"
}

PagerDuty merge tất cả event cùng dedupKey thành 1 incident. Khi alert recover, gửi event với cùng dedupKey và eventAction: "resolve" để auto-close.

Pitfall: nếu dedupKey thay đổi giữa các lần trigger (do template variable thay đổi), mỗi event là incident riêng. Test template bằng cách trigger 2 lần liên tiếp và xem có dedup không.

Slack thread

Không native dedup, nhưng pattern: gửi message đầu tiên là parent, các update tiếp theo gửi vào thread.

Slack API support thread_ts field, nhưng Kibana connector mặc định không có. Workaround: dùng webhook connector thay vì Slack connector native, custom payload:

{
  "channel": "alerts-prod",
  "text": "{{rule.name}}",
  "thread_ts": "{{rule.id}}"
}

Lưu ý: thread_ts phải là timestamp message gốc, không phải rule id. Cần custom code ở receiver side để track.

Email aggregation

Email khó dedup. Pattern: gửi email digest mỗi giờ, không gửi từng alert.

Setup: rule với notifyWhen: onThrottleInterval, throttle 1 giờ. Mỗi giờ, action gửi 1 email với toàn bộ alert active trong giờ qua (qua context.results nếu rule support).

Webhook idempotency

Endpoint custom phải tự dedup. Pattern: dùng rule.id + alert.id + state làm idempotency key trong DB.

INSERT INTO alerts (key, ...) VALUES (?, ...) ON CONFLICT (key) DO UPDATE SET ...

Hoặc dùng Redis SETNX với TTL:

SETNX alert:rule_xxx:alert_yyy:active 1 EX 3600

Nếu key đã tồn tại → skip notification.

Severity routing

Không phải alert nào cũng cần wake up on-call. Pattern routing:

Severity	Channel	Latency
Critical (SLO burn 14.4x, prod down)	PagerDuty	Page ngay
High (SLO burn 6x, partial outage)	Slack mention @sre	< 15 phút
Medium (SLO burn 1x, anomaly)	Slack channel	< 1 giờ
Low (warning, capacity)	Email digest	Daily

Implement: tạo nhiều connector riêng cho mỗi channel. Trong rule action config, gắn nhiều action với group khác nhau:

"actions": [
  {
    "group": "default",
    "id": "<SLACK_INFO>",
    "params": {"message": "{{rule.name}}: anomaly"},
    "frequency": {"notify_when": "onThrottleInterval", "throttle": "1h"}
  },
  {
    "group": "critical",
    "id": "<PAGERDUTY_ID>",
    "params": {"severity": "critical"},
    "frequency": {"notify_when": "onActionGroupChange"}
  }
]

Rule type quyết định action group available (default, recovered, critical, etc.). Burn rate rule có nhiều action group cho từng window threshold.

Audit và tune

Sau 2-4 tuần chạy, audit rule để loại bỏ rule không value.

Metric đo

Trigger count: rule trigger bao nhiêu lần trong tuần
Acknowledge rate: trong PagerDuty, % alert được ack vs auto-resolve
Resolution time: alert active bao lâu trước khi recover

Query metric qua Kibana saved object index:

GET .kibana_alerting_cases/_search
{
  "query": {
    "term": { "type": "alert" }
  },
  "_source": ["alert.name", "alert.params", "alert.actions"]
}

Kết hợp với event log index .kibana-event-log-* để biết execution history.

Quy tắc audit

Rule có trigger > 100 lần/tuần, ack < 10%: alert hoặc quá nhạy hoặc không actionable. Cân nhắc:

Tăng threshold (nhạy hơn = nhiều noise hơn)
Thêm time window (1m -> 5m sustained)
Giảm severity (Critical -> Medium)
Disable nếu không actionable

Rule có trigger > 50 lần/tuần, auto-resolve > 80%: alert flapping. Fix bằng:

for N consecutive checks (Kibana 8.6+)
Tăng time window
Threshold dải rộng hơn

Rule có trigger 0 lần trong 30 ngày: không value hoặc condition không match thực tế. Disable hoặc revisit query.

Những lỗi dễ gặp

Ca 1: PagerDuty incident “không close”

Rule recover sau 30 phút, PagerDuty incident vẫn open. Lý do: recovery action có dedupKey {{rule.id}}-{{context.group}} nhưng trigger action có {{rule.name}}-{{context.group}} (nhân viên copy paste sai). PagerDuty thấy hai event khác key → không dedup.

Fix: chuẩn hoá format dedupKey trong cả trigger và recovery; nếu có thể, dùng string cố định thay vì template dễ lệch.

Ca 2: Throttle bị bypass

Rule với throttle: 1h. Bị OOM restart Kibana node. Sau restart, throttle counter reset, alert active gửi notification ngay lập tức cho dù vừa gửi 5 phút trước.

Fix: chấp nhận edge case này, hoặc dùng PagerDuty dedup làm fallback (vì PagerDuty không reset state khi Kibana restart).

Ca 3: Action chạy 4 lần cho 1 alert

Rule trigger 1 alert nhưng channel nhận 4 Slack message. Debug:

Check Kibana cluster có nhiều node Kibana không.
Một node giàn nó tin rule chưa chạy (clock skew) → chạy lại.
Hoặc rule có nhiều action group cùng group trigger.

Fix: sync NTP cho cluster, hoặc disable Kibana node thừa nếu không cần HA.

Ca 4: Email digest mất alert

Setup throttle: 1h cho email. Trong giờ X có 10 alert active, expect digest list 10 alert. Thực tế email chỉ show 1 alert đầu tiên.

Lý do: Kibana action context chỉ chứa snapshot của alert hiện tại (cái mới nhất), không phải tích luỹ. Để có digest thật, dùng webhook → backend tự gom → gửi digest qua email.

Cách tôi thường chốt rule

Practice	Cách làm
Set `notifyWhen: onActionGroupChange` mặc định	Tránh notification mỗi check
Throttle theo channel (Slack 10m, Email 1h)	Phù hợp tần suất kênh
dedupKey ổn định cho PagerDuty	Đảm bảo merge incident
Group alert ở rule level	Tránh 30 alert song song
Severity routing	Critical -> page, Low -> digest
Audit trigger count + ack rate hàng tháng	Loại rule không value
`for N consecutive checks`	Tránh flapping
Maintenance window cho planned change	Bypass alert đúng lúc

Ghi nhanh

Setting	Default	Recommended
`notifyWhen`	onActionGroupChange	onActionGroupChange
`throttle` rule level	none	none (dùng action level)
Slack action throttle	onActionGroupChange	10m
PagerDuty action throttle	onActionGroupChange	1m
Email action throttle	onActionGroupChange	1h
Schedule	1m	1m (5m cho rule chậm)
Time window	5m	2 * ingest_lag + schedule
Consecutive checks	1	2-3 cho rule flappy

Chốt lại

Alert fatigue không phải vấn đề kỹ thuật, mà là vấn đề thiết kế. Kibana có đủ tool: notifyWhen, throttle, dedupKey, severity routing. Cái thiếu thường là disciplined audit mỗi tháng để loại rule noise. Một channel #alerts có 50 message/tuần với 80% được đọc đáng giá hơn một channel 5000 message với 5% được đọc.

Đây là bài cuối Part 3 của series Kibana từ A đến Z. Tiếp theo là Part 4 về Security và Access Control: multi-team dùng chung một cluster mà không giẫm chân nhau, cấu hình spaces, RBAC và audit log cho SOC2/ISO.