Snapshot & Restore: backup ES lên S3, disaster recovery

Ai từng chạy ELK trong production đều có một câu chuyện về cluster die: data corruption, node crash đồng loạt, ops xoá nhầm index, hoặc nguyên cụm EBS bị mất do AZ outage. Câu hỏi duy nhất quan trọng lúc đó: “Restore bao lâu?”. Nếu chưa có snapshot test thì câu trả lời là “không biết, có thể không restore được”, và cluster ELK biến thành write-only museum.

Đây là bài 17 trong series Kibana từ A đến Z. Mục tiêu thực tế: đừng chỉ có snapshot, hãy biết nó restore được trong bao lâu.

Các phần chính:

Setup S3 snapshot repository chuẩn production.
Snapshot Lifecycle Management (SLM) tự động hàng giờ/ngày.
Restore 1 index riêng và toàn cluster.
Test DR drill và đo RPO/RTO.
Tránh lỗi version incompatibility và bucket sai region.

Snapshot là gì, không là gì

Là: copy incremental của shard và metadata sang storage bên ngoài. Lần snapshot đầu tốn full size, các lần sau chỉ tốn delta (chỉ segment mới).

Không là: dump JSON. Snapshot không phải re-index. Restore là copy ngược shard về node ES, mở lại.

So với elasticdump (npm tool dump JSON):

	Snapshot	elasticdump
Speed	Nhanh, dùng segment	Chậm, 1 doc 1 doc
Incremental	Có	Không
Versioning	Snapshot id	File timestamp
Cross-version	Hỗ trợ N-2	Có thể issue mapping
Production-ready	Có	Chỉ migration nhỏ

Snapshot là chuẩn. elasticdump chỉ dùng khi migrate nhỏ giữa cluster version chênh nhau quá xa.

Cài plugin S3 repository

ES 8.x đã build sẵn repository-s3 cho hầu hết distribution. Verify:

GET _nodes/plugins
# Tìm "repository-s3"

Nếu thiếu, cài plugin trên TỪNG node:

bin/elasticsearch-plugin install repository-s3
# Restart node

Lưu ý: Elastic Cloud và Docker official có sẵn. Self-managed Linux có thể thiếu, đặc biệt build minimal.

AWS IAM cho snapshot

Tạo policy IAM cho bucket es-snapshots-prod:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": "arn:aws:s3:::es-snapshots-prod"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts"
      ],
      "Resource": "arn:aws:s3:::es-snapshots-prod/*"
    }
  ]
}

Gắn policy vào:

IAM role attach vào EC2 instance (nếu chạy trên EC2), không cần access key.
IRSA role nếu chạy trên EKS.
IAM user với access key (last resort cho on-prem hoặc lab).

Đề xuất: object versioning ON, object lock COMPLIANCE 30 ngày cho audit snapshot (bài 15). Không object lock cho data snapshot vì có thể cần delete để dọn dẹp.

Lưu access key vào ES keystore:

bin/elasticsearch-keystore add s3.client.default.access_key
bin/elasticsearch-keystore add s3.client.default.secret_key
# Reload trên cluster
POST _nodes/reload_secure_settings

Đăng ký repository

PUT _snapshot/s3-prod
{
  "type": "s3",
  "settings": {
    "bucket": "es-snapshots-prod",
    "region": "ap-southeast-1",
    "base_path": "cluster-a/",
    "compress": true,
    "server_side_encryption": true,
    "storage_class": "intelligent_tiering"
  }
}

Phân tích settings:

bucket: tên S3 bucket.
region: vùng bucket. Sai region = error 403 với message confusing.
base_path: prefix trong bucket. Cho phép 1 bucket chứa nhiều cluster snapshot.
compress: true: nén metadata. Data shard đã nén bên trong nên compress chủ yếu giúp metadata.
server_side_encryption: SSE-S3. Muốn KMS thì thêm canned_acl và key id.
storage_class: dùng intelligent-tiering cho cluster lớn. Standard mặc định.

Verify:

POST _snapshot/s3-prod/_verify

Response báo node nào access OK. Lỗi AccessDenied thường do IAM policy thiếu s3:ListBucket.

Snapshot thủ công đầu tiên

PUT _snapshot/s3-prod/snapshot-2026-05-17
{
  "indices": "app-logs-*,kibana-*",
  "ignore_unavailable": true,
  "include_global_state": true
}

Field include_global_state:

true: snapshot cả cluster setting, ILM policy, index template, user, role.
false: chỉ snapshot data index.

Cho disaster recovery: bật. Cho snapshot test/migration: tắt.

Theo dõi tiến độ:

GET _snapshot/s3-prod/snapshot-2026-05-17/_status

Field stats.total.size_in_bytes cho biết tổng bytes phải copy. Lần đầu tốn nhiều, lần sau giảm rõ vì incremental.

List snapshot trong repo:

GET _snapshot/s3-prod/_all

Snapshot Lifecycle Management (SLM)

Đừng dùng cron. Dùng SLM, native trong ES, có retry và retention.

PUT _slm/policy/daily-app-logs
{
  "name": "<daily-app-logs-{now/d}>",
  "schedule": "0 30 1 * * ?",
  "repository": "s3-prod",
  "config": {
    "indices": ["app-logs-*"],
    "ignore_unavailable": false,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 7,
    "max_count": 60
  }
}

Phân tích:

name: template với date math. {now/d} resolve thành ngày hiện tại UTC.
schedule: cron format 6 field (giây phút giờ ngày tháng dayofweek). 0 30 1 * * ? = 01:30 mỗi ngày.
retention: giữ 7-60 snapshot, xoá sau 30 ngày. Always giữ tối thiểu 7 dù quá hạn.

Trigger thủ công để test:

POST _slm/policy/daily-app-logs/_execute

Status:

GET _slm/policy/daily-app-logs
GET _slm/stats

Pattern multi-policy:

hourly-critical: snapshot hàng giờ cho customer-orders-*, retention 48h.
daily-all: snapshot ngày cho mọi index, retention 30d.
weekly-archive: snapshot tuần với include_global_state: true, retention 1 năm.

RPO (Recovery Point Objective) giảm xuống mức snapshot interval gần nhất.

Restore (các kiểu)

Kiểu 1: restore 1 index

POST _snapshot/s3-prod/snapshot-2026-05-17/_restore
{
  "indices": "app-logs-2026.04.01",
  "rename_pattern": "app-logs-(.+)",
  "rename_replacement": "restored-app-logs-$1",
  "include_global_state": false
}

Rename để không conflict với index hiện có. Sau khi verify, switch alias:

POST _aliases
{
  "actions": [
    { "remove": { "index": "app-logs-2026.04.01", "alias": "app-logs-readonly" } },
    { "add":    { "index": "restored-app-logs-2026.04.01", "alias": "app-logs-readonly" } }
  ]
}

Kiểu 2: restore full cluster

Trường hợp DR. Cluster mới khởi tạo từ snapshot repository cũ:

# 1. Đăng ký repo trên cluster mới
PUT _snapshot/s3-prod
{
  "type": "s3",
  "settings": { "bucket": "es-snapshots-prod", "region": "ap-southeast-1", "base_path": "cluster-a/", "readonly": true }
}

# 2. Restore tất cả
POST _snapshot/s3-prod/snapshot-2026-05-17/_restore
{
  "indices": "*",
  "include_global_state": true,
  "include_aliases": true
}

readonly: true quan trọng khi đăng ký lại repository ở cluster mới. Tránh ghi đè snapshot của cluster cũ.

Kiểu 3: restore vào cluster khác version

ES hỗ trợ restore từ 1-2 major version cũ hơn. Restore 7.x snapshot vào 8.x được. Restore 6.x vào 8.x không được (cách 2 major).

Workaround: spin cluster trung gian (ví dụ 7.17), restore từ 6.x, snapshot lại, restore vào 8.x. Đắt nhưng đôi khi không có lối khác.

Test DR (bài tập phải làm)

Snapshot không test = không có snapshot. Lịch DR drill khuyến nghị: mỗi quý.

Quy trình drill

Spin cluster mới (3 node, cùng version, EBS volume riêng).
Đăng ký repository S3 với readonly: true.
Restore snapshot ngày hôm trước.
Đo timing:
- T0: bắt đầu restore.
- T1: cluster green.
- T2: search trả về kết quả đúng.
- T2 minus T0 = RTO thực tế.
So sample data: SHA-256 hash 100 doc ngẫu nhiên ở cluster gốc và cluster restore. Phải khớp.
Document kết quả vào runbook compliance.

Trong drill thường lộ ra:

ILM policy không restore vì include_global_state: false.
API key cũ không hoạt động (snapshot không gồm api_key).
Kibana saved object không restore nếu không snapshot index .kibana_*.

Fix các lỗ hổng này trước khi DR thật xảy ra, không khi đang khẩn cấp.

Bảng RPO/RTO theo chiến lược

Strategy	RPO	RTO	Cost	Khi dùng
SLM mỗi 24h	24h	30-120 phút	Thấp	Log non-critical
SLM mỗi 1h	1h	30-120 phút	Trung bình	Most production
SLM 15 phút và cross-cluster replication	dưới 1 phút	dưới 5 phút	Cao	Tier-1 service
Searchable snapshot S3 cold	n/a	minute-level lazy	Thấp	Compliance archive
Hot standby cluster (CCR)	dưới 5s	seconds	Rất cao	Mission-critical

CCR (Cross-Cluster Replication) là Enterprise feature, replicate gần real-time từ cluster A sang B. Pattern: snapshot là backup, CCR là DR site. Hai cái không thay thế nhau, chúng bổ sung.

Pitfall hay gặp

bucket region khác cluster

Cluster ở ap-southeast-1, bucket tạo nhầm us-east-1. Snapshot vẫn chạy nhưng tốn 5-10x bandwidth và charge cross-region. Một team tôi từng làm cùng mới phát hiện sau hoá đơn AWS tháng đầu nhảy 4x. Fix: tạo bucket cùng region.

snapshot include `.security-7` rồi restore lệch user

Index .security-7 chứa user, role, role mapping. Restore vào cluster mới với include_global_state: true sẽ ghi đè security của cluster mới. Nếu admin password cluster mới khác cluster cũ thì cluster mới đột nhiên không login được bằng admin mới (chỉ login được bằng admin cũ).

Fix: trước restore, đổi password ở cluster mới về chuẩn cluster cũ, hoặc restore không kèm .security-7:

"indices": "*,-.security-*"

SLM bị stop âm thầm

POST _slm/stop từng được chạy trong session maintenance, không ai start lại. 2 tuần sau, không có snapshot. Fix: alert khi slm.operation_mode != "RUNNING". Saved Search trong Kibana lên Stack Monitoring để spot.

snapshot không bao gồm Kibana saved object

Saved object Kibana nằm trong index .kibana_*. Snapshot app-logs-* không bao gồm. Restore xong, dashboard mất hết. Fix: snapshot có indices: "app-logs-*,.kibana*" hoặc snapshot riêng dashboard NDJSON vào git.

storage class nhầm

Snapshot lên S3 Glacier Deep Archive là hợp lý cho long-term archive (giá rẻ 1/10), nhưng restore từ Glacier mất nhiều giờ. Nếu set Lifecycle Rule S3 auto move snapshot sang Glacier sau 30 ngày mà không biết, snapshot 31 ngày trước restore mất 12 giờ. Đề xuất: Standard hoặc Intelligent Tiering cho snapshot, chỉ Glacier cho archive bucket riêng.

shared filesystem repository

Repository type fs (NFS shared) tiện cho lab, nguy hiểm production. NFS lock, file consistency, hiệu năng đều kém. Chuyển sang S3 (hoặc Azure Blob, GCS) ngay khi lên production.

Verify integrity định kỳ

Mỗi tháng chạy:

POST _snapshot/s3-prod/_verify

Verify mọi node có thể read repo. Không verify được = restore không được.

API mới hơn (ES 8.x): _analyze cho repository, deep check checksum:

POST _snapshot/s3-prod/_analyze?blob_count=100&max_blob_size=10mb

Chạy off-peak, tốn S3 GET request.

Ghi nhanh

Việc	Cách
Đăng ký repo	PUT `/_snapshot/<repo>`
Verify repo	POST `/_snapshot/<repo>/_verify`
Snapshot thủ công	PUT `/_snapshot/<repo>/<name>`
SLM policy	PUT `/_slm/policy/<name>`
Trigger SLM	POST `/_slm/policy/<name>/_execute`
Status SLM	GET `/_slm/policy/<name>`
Restore 1 index	POST `/_snapshot/<repo>/<name>/_restore` và `indices` và `rename_*`
Restore full	POST với `indices: "*"` và `include_global_state: true`
Read-only repo	`settings.readonly: true`
Loại trừ security	`indices: ",-.security-"`
Test integrity	POST `/_snapshot/<repo>/_analyze`

Chốt lại

Snapshot là tài sản chủ chốt. SLM tự động hoá việc tạo, retention chặt chẽ giữ bucket khỏi phình, drill quý hoá ra mọi lỗ hổng. Đừng đợi tới khi mất cluster mới biết restore không chạy được. Đặt 1 reminder calendar quarterly: “ES DR drill”. Lần đầu mất 1 ngày, lần thứ ba thuần thục mất 2 giờ.

Bài 18 chuyển sang một lỗi production khác: Kibana behind reverse proxy. Nginx hoặc Cloudflare giúp có TLS, WAF, multi-domain, nhưng cũng dễ làm hỏng XSRF header và websocket nếu cấu hình thiếu.