Terraform và Kibana: quản lý saved objects, rules, connectors như infrastructure

NDJSON + Git + CI/CD đã giải quyết bài toán versioning cho Kibana, nhưng vẫn là approach imperative: “import file này vào cluster đó”. Terraform đảo ngược tư duy: declare state mong muốn, provider lo phần đồng bộ. Dashboard biến mất khỏi cluster? Terraform apply, đi mất tiêu cũng tự tạo lại. Ai đó sửa qua UI? Terraform plan thấy drift, đẩy lại version chuẩn.

Bài này dùng provider elastic/elasticstack (chính thức của Elastic) để manage Kibana resources. Không phải mọi resource đều đã support, nhưng các loại quan trọng đã đủ chạy production.

Mục tiêu bài:

Setup Terraform provider Elastic Stack
Quản lý data view, role, space, alerting rule, connector qua HCL
Pattern module và environment promotion
Migration từ NDJSON workflow sang Terraform
Tránh state drift và secret leak

Phần 1: Vì sao Terraform thay vì NDJSON

NDJSON workflow tốt nhưng có giới hạn:

Imperative: phải biết “đang ở state nào, cần làm gì để chuyển sang state khác”. Mất dashboard? Phải nhớ re-import.
Không có drift detection built-in: phải tự viết script.
Khó share giữa team: ai cần “tạo namespace mới với 5 dashboard + 3 alert” phải tự copy NDJSON file.
State management thủ công: không biết có bao nhiêu version cũ đang nằm trong cluster.

Terraform giải quyết:

Declarative: file .tf mô tả desired state, provider lo phần delta.
State file: nguồn truth duy nhất, biết resource nào do Terraform manage.
Plan trước Apply: thấy được change trước khi commit.
Module reuse: một module “team-namespace” có thể instantiate nhiều lần với variable khác.

Bảng so sánh:

Khía cạnh	NDJSON + CI/CD	Terraform
Diff dễ đọc	Trung bình (JSON sortable)	Tốt (plan output có màu)
Rollback	Re-import file cũ	`terraform apply` với git checkout
Drift detection	Phải tự script	Built-in `terraform plan`
Multi-env promotion	Tự script + secret swap	Workspace hoặc tfvars
Resource coverage	Tất cả saved object	Một subset (đang mở rộng)
Learning curve	Thấp	Trung bình
Lock & state collab	Không có	Có (S3 + DynamoDB)

Terraform không thay được hoàn toàn NDJSON workflow cho mọi case. Pattern hybrid: Terraform cho resource quan trọng (data view, role, space, alert rule, connector), NDJSON cho dashboard phức tạp.

Phần 2: Setup provider

versions.tf:

terraform {
  required_version = ">= 1.6.0"

  required_providers {
    elasticstack = {
      source  = "elastic/elasticstack"
      version = "~> 0.11"
    }
  }

  backend "s3" {
    bucket         = "infra-tfstate-prod"
    key            = "kibana/terraform.tfstate"
    region         = "ap-southeast-1"
    dynamodb_table = "tfstate-lock"
    encrypt        = true
  }
}

provider "elasticstack" {
  elasticsearch {
    endpoints = [var.es_endpoint]
    api_key   = var.es_api_key
  }
  kibana {
    endpoints = [var.kibana_endpoint]
    api_key   = var.kibana_api_key
  }
}

Hai provider config riêng cho ES và Kibana. Cùng dùng API key, nhưng có thể khác key (key Kibana cần manage_saved_objects, key ES cần manage cluster).

Provider cần version ES 8.x+. Kibana 8.x trở lên có saved object API v2 stable.

variables.tf:

variable "es_endpoint" {
  type = string
}

variable "es_api_key" {
  type      = string
  sensitive = true
}

variable "kibana_endpoint" {
  type = string
}

variable "kibana_api_key" {
  type      = string
  sensitive = true
}

variable "environment" {
  type    = string
  default = "dev"
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Must be dev, staging, or prod."
  }
}

Phần 3: Resource cơ bản

Data view

resource "elasticstack_kibana_data_view" "app_logs" {
  data_view = {
    name            = "app-logs-${var.environment}"
    title           = "app-logs-${var.environment}-*"
    time_field_name = "@timestamp"
  }
}

Title chính là index pattern. Khi cluster mới created, Terraform sẽ tạo data view này. Nếu data view bị xoá qua UI, terraform plan báo “to add 1”.

Role

resource "elasticstack_elasticsearch_security_role" "logs_reader" {
  name = "logs_reader_${var.environment}"

  cluster = ["monitor"]

  indices {
    names      = ["app-logs-${var.environment}-*"]
    privileges = ["read", "view_index_metadata"]
  }

  applications {
    application = "kibana-.kibana"
    privileges  = ["read"]
    resources   = ["space:default"]
  }
}

Tách role per environment để tránh dev role có quyền vào prod index.

Space

resource "elasticstack_kibana_space" "platform_team" {
  space_id    = "platform-team"
  name        = "Platform Team"
  description = "Workspace for platform engineering team"
  initials    = "PT"
  color       = "#1B5E20"

  disabled_features = ["canvas", "ml"]
}

Tách space cho mỗi team con. Lockdown feature không cần thiết.

Alert rule

resource "elasticstack_kibana_alerting_rule" "error_burst" {
  name         = "error-burst-${var.environment}"
  consumer     = "alerts"
  rule_type_id = ".es-query"
  schedule = {
    interval = "1m"
  }

  params = jsonencode({
    index               = ["app-logs-${var.environment}-*"]
    timeField           = "@timestamp"
    esQuery             = jsonencode({
      query = { match = { Level = "Error" } }
    })
    size                = 100
    threshold           = [10]
    thresholdComparator = ">"
    timeWindowSize      = 5
    timeWindowUnit      = "m"
  })

  actions = [
    {
      group  = "query matched"
      id     = elasticstack_kibana_action_connector.slack.connector_id
      params = jsonencode({
        message = "{{context.hits}} errors in last 5 min on ${var.environment}"
      })
    }
  ]
}

Connector (Slack)

resource "elasticstack_kibana_action_connector" "slack" {
  name              = "slack-platform-alerts-${var.environment}"
  connector_type_id = ".slack"

  secrets = jsonencode({
    webhookUrl = var.slack_webhook_url
  })

  config = jsonencode({})
}

Reference từ alert rule sang connector qua attribute connector_id, Terraform tự gen dependency graph: tạo connector trước, alert sau.

Phần 4: Module pattern

Khi có 5 team, mỗi team cần space + 10 role + 3 alert rule, copy-paste HCL không ổn. Module pattern:

modules/team-workspace/main.tf:

variable "team_id" {
  type = string
}

variable "team_name" {
  type = string
}

variable "log_indices" {
  type = list(string)
}

variable "slack_webhook" {
  type      = string
  sensitive = true
}

resource "elasticstack_kibana_space" "this" {
  space_id = var.team_id
  name     = var.team_name
}

resource "elasticstack_elasticsearch_security_role" "reader" {
  name    = "${var.team_id}_reader"
  cluster = ["monitor"]
  indices {
    names      = var.log_indices
    privileges = ["read", "view_index_metadata"]
  }
  applications {
    application = "kibana-.kibana"
    privileges  = ["read"]
    resources   = ["space:${var.team_id}"]
  }
}

resource "elasticstack_kibana_action_connector" "slack" {
  name              = "slack-${var.team_id}"
  connector_type_id = ".slack"
  secrets           = jsonencode({ webhookUrl = var.slack_webhook })
  config            = jsonencode({})
}

output "space_id" {
  value = elasticstack_kibana_space.this.space_id
}

output "reader_role" {
  value = elasticstack_elasticsearch_security_role.reader.name
}

environments/prod/main.tf:

module "team_platform" {
  source = "../../modules/team-workspace"

  team_id       = "platform"
  team_name     = "Platform Team"
  log_indices   = ["platform-logs-*"]
  slack_webhook = var.platform_slack_webhook
}

module "team_billing" {
  source = "../../modules/team-workspace"

  team_id       = "billing"
  team_name     = "Billing Team"
  log_indices   = ["billing-logs-*", "billing-audit-*"]
  slack_webhook = var.billing_slack_webhook
}

Thêm team mới = thêm một block module 5 dòng. Reduce noise gấp 10 lần so với HCL flat.

Phần 5: Multi-environment promotion

Pattern terraform workspace:

terraform workspace new dev
terraform workspace new staging
terraform workspace new prod

terraform workspace select dev
terraform apply -var-file=dev.tfvars

terraform workspace select prod
terraform apply -var-file=prod.tfvars

Mỗi workspace có state file riêng. var.environment reference vào terraform.workspace:

locals {
  env = terraform.workspace
}

Hoặc tách hẳn folder per environment (nhiều team prefer cách này vì plan của prod không thể leak sang dev):

infra-kibana-tf/
├── modules/
│   └── team-workspace/
└── environments/
    ├── dev/
    │   ├── main.tf
    │   ├── backend.tf
    │   └── dev.tfvars
    ├── staging/
    │   └── ...
    └── prod/
        └── ...

CI pipeline:

deploy-staging:
  needs: validate
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - uses: hashicorp/setup-terraform@v3
    - run: terraform init
      working-directory: environments/staging
    - run: terraform plan -out=tfplan
      working-directory: environments/staging
      env:
        TF_VAR_es_api_key: ${{ secrets.ES_API_KEY_STAGING }}
        TF_VAR_kibana_api_key: ${{ secrets.KIBANA_API_KEY_STAGING }}
    - run: terraform apply tfplan
      working-directory: environments/staging

Production cần manual approval via GitHub environment.

Phần 6: Migration từ NDJSON sang Terraform

Có sẵn NDJSON, không cần làm lại từ đầu. Pattern import:

# Bước 1: list resource hiện có trên cluster
curl -sS -H "Authorization: ApiKey ${KEY}" \
  "${KB_URL}/api/saved_objects/_find?type=alert" | jq

# Bước 2: tạo HCL block tương ứng
cat > alert-error-burst.tf <<EOF
resource "elasticstack_kibana_alerting_rule" "error_burst" {
  name         = "error-burst"
  consumer     = "alerts"
  rule_type_id = ".es-query"
  # ... fields khác
}
EOF

# Bước 3: import vào state
terraform import elasticstack_kibana_alerting_rule.error_burst <rule-id>

# Bước 4: terraform plan, check zero diff
terraform plan

Nếu terraform plan báo có diff, nghĩa là HCL chưa khớp với state thực. Sửa HCL cho khớp rồi import lại.

Pattern tôi đã thấy hiệu quả: migrate từng nhóm resource một, không big-bang. Tuần 1: data view + role. Tuần 2: connector + alert rule. Tuần 3: space. Dashboard giữ NDJSON workflow vì provider chưa hỗ trợ tốt.

Phần 7: Pitfall thực tế

Pitfall 1: Secret rotate khiến state corrupt

Webhook Slack rotate, Terraform thấy diff ở secrets (vì secret không read được từ cluster). apply sẽ recreate connector, đổi ID, alert rule reference bị orphan.

Fix: dùng lifecycle.ignore_changes cho field secret:

resource "elasticstack_kibana_action_connector" "slack" {
  name              = "slack-alerts"
  connector_type_id = ".slack"
  secrets           = jsonencode({ webhookUrl = var.slack_webhook_url })
  config            = jsonencode({})

  lifecycle {
    ignore_changes = [secrets]
  }
}

Rotate secret thủ công qua UI hoặc API, Terraform không bao giờ touch.

Pitfall 2: State drift không phát hiện

Dev sửa alert rule qua UI để demo, quên revert. Tuần sau Terraform apply, drift bị overwrite về version cũ, alert ngừng hoạt động đúng cách.

Fix: chạy terraform plan định kỳ trong CI và alert nếu có diff:

drift-detection:
  schedule:
    - cron: '0 8 * * 1-5'  # 8 AM weekday
  steps:
    - run: terraform plan -detailed-exitcode || echo "DRIFT_FOUND=true" >> $GITHUB_ENV
    - if: env.DRIFT_FOUND == 'true'
      run: |
        gh issue create --title "Kibana drift detected" \
          --body "Terraform plan shows drift. Run: terraform apply"

-detailed-exitcode: 0 = no change, 1 = error, 2 = drift. Catch 2 và alert.

Pitfall 3: Provider chưa support resource mới

Elastic ra alert type mới (vd: ML anomaly), provider chưa update. Terraform không tạo được.

Fix: fall back sang null_resource + local-exec với curl:

resource "null_resource" "ml_anomaly_rule" {
  triggers = {
    config_hash = sha256(file("ml-anomaly.json"))
  }

  provisioner "local-exec" {
    command = <<-EOT
      curl -sS \
        -H "Authorization: ApiKey ${var.kibana_api_key}" \
        -H "kbn-xsrf: true" \
        -H "Content-Type: application/json" \
        -X POST "${var.kibana_endpoint}/api/alerting/rule" \
        -d @ml-anomaly.json
    EOT
  }
}

Không pretty, nhưng unblock. Khi provider support thì refactor lại proper.

Checklist nhanh

Việc	HCL resource
Data view	`elasticstack_kibana_data_view`
Role	`elasticstack_elasticsearch_security_role`
User	`elasticstack_elasticsearch_security_user`
Space	`elasticstack_kibana_space`
Alert rule	`elasticstack_kibana_alerting_rule`
Connector	`elasticstack_kibana_action_connector`
Cluster setting	`elasticstack_elasticsearch_cluster_settings`
Ingest pipeline	`elasticstack_elasticsearch_ingest_pipeline`
Index template	`elasticstack_elasticsearch_index_template`
ILM policy	`elasticstack_elasticsearch_index_lifecycle`

Command	Mô tả
`terraform init`	Init backend + provider
`terraform plan`	Xem delta
`terraform apply`	Áp dụng
`terraform import`	Đưa resource có sẵn vào state
`terraform state list`	List resource trong state
`terraform state rm`	Bỏ resource khỏi state (không xoá)
`terraform workspace`	Multi-env state

Chốt lại

Terraform không phải silver bullet, nhưng là pattern declarative tối ưu khi infrastructure phức tạp và có nhiều người cùng touch. Bắt đầu nhỏ với data view + role, rồi mở rộng dần. Đừng cố migrate everything cùng lúc.

Series Kibana từ A đến Z chính thức bước sang Part 7 từ bài tiếp theo: troubleshooting. Bài 25 sẽ là một checklist debug khi Kibana không load được, từ browser console tới Elasticsearch cluster state, đi qua từng layer một cách có hệ thống thay vì gãi đầu đoán đại.