Elasticsearch query chậm: profiler, slow log và shard distribution

Dashboard “Error Overview” mở mất 30 giây. Discover query 5 ngày dữ liệu phải đợi 2 phút. Bạn nhìn cluster health: green. Nhìn CPU node: 20%. Nhìn RAM: thoải mái. Vậy ai chậm?

Câu trả lời gần như luôn nằm ở một trong 3 chỗ: query không tối ưu (full scan thay vì filter), shard phân bố lệch, hoặc segment merge backlog. Bài này dạy bạn dùng 3 công cụ chính của Elasticsearch để locate exact root cause: profile API, slow log, và cat shards.

Mục tiêu bài:

Bật slow log cho index và search
Đọc profile API output không bị ngợp
Hiểu impact của shard count + size
Pattern fix top 5 query chậm
Khi nào reindex, khi nào tune query, khi nào scale

Phần 1: Mental model

Một query ES đi qua 3 stage:

[Client]
    ↓ HTTP request
[Coordinating Node]   broadcast query tới shards
    ↓
[Shards]              mỗi shard search local, return top N
    ↓
[Coordinating Node]   merge results, return client

Chậm có thể ở mỗi stage:

Stage	Triệu chứng	Nguyên nhân
Coordinating	tổng thời gian cao, nhưng từng shard nhanh	Merge nhiều bucket, fetch source lớn
Shard search	một số shard slow, số khác nhanh	Shard size lệch, hot shard, segment chưa merge
Shard fetch	search nhanh, fetch chậm	`_source` lớn, nhiều highlight

Hiểu được stage nào chậm là 80% công việc debug.

Phần 2: Slow log

Slow log ghi nhận query vượt ngưỡng thời gian. Bật trước, debug sau.

Bật slow log per index

curl -X PUT "http://es:9200/app-logs-*/_settings" \
  -H 'Content-Type: application/json' \
  -d '{
    "index.search.slowlog.threshold.query.warn":  "10s",
    "index.search.slowlog.threshold.query.info":  "5s",
    "index.search.slowlog.threshold.query.debug": "2s",
    "index.search.slowlog.threshold.query.trace": "500ms",
    "index.search.slowlog.threshold.fetch.warn":  "1s",
    "index.search.slowlog.threshold.fetch.info":  "500ms",
    "index.indexing.slowlog.threshold.index.warn": "10s",
    "index.indexing.slowlog.threshold.index.info": "5s"
  }'

query là phase search, fetch là phase lấy source document. Tách hai threshold quan trọng vì root cause khác nhau.

Đọc slow log

Slow log nằm ở:

/var/log/elasticsearch/<cluster-name>_index_search_slowlog.log
/var/log/elasticsearch/<cluster-name>_index_indexing_slowlog.log

Một dòng điển hình:

[2026-05-17T03:21:45,123][INFO ][i.s.s.query] [es-node-1]
[app-logs-2026.05.17][2] took[3.2s], took_millis[3200],
total_hits[245013 hits], stats[], search_type[QUERY_THEN_FETCH],
total_shards[5], source[{"query":{"bool":...}}]

Field quan trọng:

took: tổng thời gian query phase
total_hits: matched docs
source: body query (cắt 2000 ký tự đầu)
[index][shard_id]: shard nào chạy

Pattern phổ biến: nếu cùng query xuất hiện trên nhiều shard với thời gian khác nhau, đó là sign shard size lệch.

Phần 3: Profile API

Slow log nói “query này chậm”. Profile API nói vì sao chậm.

Cách bật

Thêm "profile": true vào body query:

curl -X POST "http://es:9200/app-logs-*/_search" \
  -H 'Content-Type: application/json' \
  -d '{
    "profile": true,
    "query": {
      "bool": {
        "must": [
          {"match": {"Level": "Error"}},
          {"range": {"@timestamp": {"gte": "now-1d"}}}
        ]
      }
    },
    "aggs": {
      "by_app": {
        "terms": {"field": "Properties.ApplicationName.keyword"}
      }
    }
  }'

Đọc output

Output có structure (rút gọn):

{
  "profile": {
    "shards": [
      {
        "id": "[node-1][app-logs-2026.05.17][0]",
        "searches": [{
          "query": [
            {
              "type": "BooleanQuery",
              "description": "+Level:Error +@timestamp:[...]",
              "time_in_nanos": 3210000000,
              "breakdown": {
                "score": 12000,
                "build_scorer": 800000000,
                "next_doc": 2400000000,
                "advance": 5000
              },
              "children": [...]
            }
          ],
          "aggregations": [
            {
              "type": "GlobalOrdinalsStringTermsAggregator",
              "description": "by_app",
              "time_in_nanos": 1500000000,
              "breakdown": {...}
            }
          ]
        }]
      }
    ]
  }
}

Đọc theo logic:

time_in_nanos của top-level query: tổng thời gian
breakdown.next_doc cao: scan nhiều document (filter không hiệu quả)
breakdown.build_scorer cao: query phức tạp (wildcard, script)
aggregations[].time_in_nanos cao: agg nặng (cardinality, scripted_metric)

Tip practical

Profile API output khổng lồ. Pattern dùng:

curl ... > profile.json
jq '.profile.shards[] | {id, took: .searches[0].query[0].time_in_nanos}' profile.json

So sánh took giữa các shard. Nếu 1 shard 5s, 4 shard 200ms = hot shard.

Phần 4: Hot shard và distribution

Check shard size

curl -sS "http://es:9200/_cat/shards/app-logs-*?v&s=store:desc" | head -20

index                shard prirep state    docs   store
app-logs-2026.05.17  0     p      STARTED  2.3m   3.2gb
app-logs-2026.05.17  1     p      STARTED  450k   620mb
app-logs-2026.05.17  2     p      STARTED  480k   640mb

Shard 0 lớn gấp 5 lần shard khác = lệch nặng. Nguyên nhân thường là:

Routing key không random: data về cùng một customer ID hash về cùng shard
Custom routing đặt sai
Hash function collision (hiếm)

Fix: reindex với routing mới hoặc xoá custom routing.

Check shard count

Best practice ES: shard primary 10-50 GB. Quá nhỏ = overhead nhiều. Quá lớn = recovery chậm và search nặng.

curl -sS "http://es:9200/_cat/indices?v&s=store.size:desc" | head -10

health index                pri rep docs.count store.size
green  app-logs-2026.05.17    5   1  3.2m       3.2gb
green  app-logs-2026.05.16    5   1  2.8m       2.9gb

Index 3 GB chia 5 shard = 600 MB/shard. Quá nhỏ. Pattern thường thấy ở team mới setup: copy template pri: 5 từ ES 5.x mà không recalc.

Fix: dùng _shrink API để gộp shard (cluster phải có đủ capacity):

curl -X POST "http://es:9200/app-logs-2026.05.17/_shrink/app-logs-2026.05.17-shrunk" \
  -H 'Content-Type: application/json' \
  -d '{
    "settings": {
      "index.number_of_shards": 1,
      "index.number_of_replicas": 1
    }
  }'

Cho index mới sắp tạo: sửa index template, set number_of_shards: 1 cho daily index.

Phần 5: Segment merge

ES lưu data trong segment, file Lucene immutable. Mỗi lần write tạo segment mới. Merge process gộp segment nhỏ thành lớn để query nhanh.

Check segment

curl -sS "http://es:9200/_cat/segments/app-logs-*?v&s=size:desc" | head -20

index                shard segment size       size.memory committed searchable
app-logs-2026.05.17  0     _4k   524mb      4.2mb       true      true
app-logs-2026.05.17  0     _3a   89mb       1.1mb       true      true
app-logs-2026.05.17  0     _45   2.1mb      0.3mb       true      true

Index có 200+ segment nhỏ < 5 MB = chưa merge xong. Search chậm vì phải mở từng segment.

Fix: force merge (chỉ với index read-only):

curl -X POST "http://es:9200/app-logs-2026.05.10/_forcemerge?max_num_segments=1"

Đừng force merge index đang write (hot index). Force merge cho warm/cold index khi không còn data mới.

Refresh interval

Refresh tạo segment mới mỗi 1 giây (default). Với heavy ingest, đây là bottleneck.

Tăng lên 30s cho ingest-heavy index:

curl -X PUT "http://es:9200/app-logs-*/_settings" \
  -H 'Content-Type: application/json' \
  -d '{"index.refresh_interval": "30s"}'

Trade-off: data hiện trong search chậm hơn 30s. OK cho log, không OK cho transactional data.

Phần 6: Top 5 pattern query chậm

Pattern 1: Wildcard leading (`*foo`)

{"query": {"wildcard": {"message": "*timeout"}}}

Lucene index theo prefix, leading wildcard = full scan term dictionary. Chậm trên index lớn.

Fix: thêm field message.reverse (analyzer reverse) hoặc dùng ngram analyzer.

Pattern 2: Aggregation cardinality cao

{
  "aggs": {
    "users": {"terms": {"field": "user_id", "size": 10000}}
  }
}

Field cardinality 10 triệu, size: 10000 = build dictionary tốn memory + thời gian.

Fix: dùng composite agg với pagination thay vì size lớn.

Pattern 3: Range trên field không indexed như date

{"range": {"created_at": {"gte": 1715000000000}}}

Nếu created_at là long thay vì date, range query chậm hơn 3-5 lần (không có BKD tree).

Fix: reindex với mapping đúng "type": "date".

Pattern 4: Sort trên `_score` với many docs

{"sort": [{"_score": "desc"}], "size": 1000}

ES tính score cho tất cả doc match rồi sort. Heavy.

Fix: nếu không cần relevance, dùng "sort": [{"@timestamp": "desc"}]. Sort trên field có docvalue nhanh hơn nhiều.

Pattern 5: `_source` lớn

{"query": {...}, "size": 1000, "_source": true}

Mỗi doc 50 KB JSON, fetch 1000 doc = 50 MB transfer + deserialize.

Fix: chỉ lấy field cần thiết:

{"_source": ["timestamp", "level", "message"]}

Hoặc dùng docvalue_fields cho field đơn giản.

Phần 7: Pitfall thực tế

Pitfall 1: ES|QL ngầm tạo query phức tạp

Một dev dùng ES|QL:

FROM app-logs-*
| WHERE Level == "Error" OR Level == "Fatal"
| STATS count() BY Properties.ApplicationName, Properties.Endpoint
| LIMIT 1000

Query này tạo aggregation 2 chiều. Khi Properties.Endpoint có cardinality cao (URL với query string), agg phình lên hàng triệu bucket. OOM.

Fix: limit cardinality hoặc filter trước khi STATS.

Pitfall 2: Painless script trong query

Script runtime field:

{"script": {"source": "doc['size'].value > 1024"}}

Chạy script cho mỗi document = chậm 10-50x so với pre-computed field.

Fix: bake field vào index time với ingest pipeline, không runtime.

Pitfall 3: Quá nhiều coordinating overhead

Cluster 12 node, mỗi query phải broadcast tới 1000 shard. Coordinating node nuốt query nhỏ thành tác vụ lớn.

Fix:

Giảm shard count
Dùng pre_filter_shard_size để skip shard không có data trong time range
Hard-route query bằng routing param

Phần 8: Diagnostic workflow

Khi nhận complaint “Kibana slow”:

1. Reproduce trong Discover hoặc Dashboard.
   ↓
2. Note exact query + time range + duration.
   ↓
3. Check slow log có entry tương ứng không.
   Nếu không -> chưa bật slow log, bật, đợi reproduce
   ↓
4. Run profile API với cùng query trực tiếp vào ES.
   ↓
5. Xác định bottleneck:
   - next_doc cao -> filter weak, thêm filter
   - build_scorer cao -> query phức tạp, simplify
   - aggregation time cao -> agg nặng, reduce cardinality
   ↓
6. Check shard distribution. Nếu lệch -> reindex.
   ↓
7. Check segment count. Nếu cao -> force merge (read-only index).
   ↓
8. Apply fix, đo lại, document trong runbook.

Đo lại trước và sau fix là quan trọng nhất. Đừng “fix” mà không có baseline. Một số “fix” làm chậm hơn.

Checklist nhanh

Việc	Endpoint / Command
Bật slow log	`PUT /<idx>/_settings` với `threshold.query.*`
Profile query	`"profile": true` trong body
Cluster health	`GET /_cluster/health?pretty`
Shard size	`GET /_cat/shards?v&s=store:desc`
Segment count	`GET /_cat/segments?v`
Hot threads	`GET /_nodes/hot_threads`
Force merge	`POST /<idx>/_forcemerge?max_num_segments=1`
Allocation explain	`GET /_cluster/allocation/explain`
Index recovery	`GET /_recovery`
Pending tasks	`GET /_cluster/pending_tasks`

Symptom	Likely cause
Query chậm đều mọi shard	Query không tối ưu
Query chậm 1-2 shard	Hot shard, lệch size
Search nhanh, fetch chậm	`_source` lớn, highlight nhiều
Aggregation chậm	Cardinality cao, bucket nhiều
Sort chậm	Sort trên _score, hoặc field không docvalue

Chốt lại

Query Elasticsearch chậm không bao giờ là một vấn đề trừu tượng. Slow log nói cho bạn biết query nào, profile API nói cho bạn biết tại sao, cat shards nói cho bạn biết data ở đâu. Có đủ 3 tool là có đủ vũ khí.

Phần tiếp theo trong series Kibana từ A đến Z đi vào tình huống căng hơn: disk full và shard imbalance. Khi cluster bắt đầu reject write, một số shard unassigned và bạn chỉ còn 30 phút trước khi business bị ảnh hưởng, thứ cần nhất là một runbook rõ ràng.