SFT: supervised fine-tuning với instruction dataset

Trong web dev, có sự khác biệt rất rõ giữa “REST API mà client tự deal với” và “client SDK convenient để dev không cần biết HTTP details”. Cùng một backend, hai cách expose. Một cái cho power user, một cái cho dev thường.

Trong LLM, có sự khác biệt tương tự giữa base model (pretrained) và instruction-tuned model (SFT). Base model là model raw, chỉ biết “tiếp tục text”. Hỏi nó “Thủ đô Việt Nam là gì?”, nó có thể trả lời “Thủ đô Việt Nam là gì? Thủ đô Lào là gì?” vì nó đang autocomplete pattern câu hỏi.

Instruction-tuned model là model được fine-tune trên hàng triệu cặp (instruction, response) format chuẩn. Cùng câu hỏi, nó trả lời “Hà Nội”. Khác biệt là SFT (Supervised Fine-Tuning), bước đầu tiên trong alignment pipeline.

Bài này đi qua data format cho SFT, chat template, loss masking, và code triển khai bằng trl library. Đọc xong, bạn có thể build SFT pipeline thực tế trên một dataset của riêng mình.

Mental model: SFT là gì

SFT là training với cùng objective cross-entropy như pretraining, nhưng trên dataset cấu trúc:

Pretraining data:
  "The quick brown fox jumps over the lazy dog. ..."
  (raw text, model predict mọi token)

SFT data:
  Instruction: "Translate to French: Hello, world"
  Response:    "Bonjour, le monde"
  (model predict tokens của Response, không predict Instruction)

Vài khác biệt chính so với pretraining:

	Pretraining	SFT
Data	Raw text crawl	Curated (instruction, response) pairs
Volume	Trillions tokens	1K-1M samples (chục triệu tokens)
Loss	Token cả câu	Chỉ token của response (loss masking)
Duration	Tháng	Giờ đến ngày
Learning rate	3e-4	1e-5 đến 2e-4 (thấp hơn)
Epochs	1	1-3

SFT không thêm kiến thức mới (kiến thức đã có trong pretraining). SFT dạy model “format response theo cách user mong đợi”.

Data format

Có nhiều format SFT data, từ đơn giản đến phức tạp:

Format 1: Plain instruction-response (Alpaca style).

{
  "instruction": "Translate to French",
  "input": "Hello, world",
  "output": "Bonjour, le monde"
}

Format 2: ChatML / OpenAI messages.

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Translate to French: Hello, world"},
    {"role": "assistant", "content": "Bonjour, le monde"}
  ]
}

Format 3: Multi-turn conversation.

{
  "messages": [
    {"role": "system", "content": "You are a coding tutor."},
    {"role": "user", "content": "Write a Python function to sum a list"},
    {"role": "assistant", "content": "def sum_list(lst):\n    return sum(lst)"},
    {"role": "user", "content": "Make it work for strings too"},
    {"role": "assistant", "content": "def sum_list(lst):\n    return sum(lst) if all(isinstance(x, (int, float)) for x in lst) else ''.join(map(str, lst))"}
  ]
}

Format 2 và 3 đã trở thành standard trong 2024-2026 vì support multi-turn và system prompt. Hầu hết SFT pipeline hôm nay dùng format này.

Dataset SFT phổ biến public:

Dataset	Size	Ngôn ngữ	Notes
`databricks/databricks-dolly-15k`	15K	English	Open source, đa dạng task
`tatsu-lab/alpaca`	52K	English	GPT-3 generated, có noise
`HuggingFaceH4/ultrachat_200k`	200K	English	Multi-turn, ChatGPT
`OpenAssistant/oasst1`	161K	Multi	Human-curated
`vilm/vicuna-format-zalo-ai-2023`	~10K	Vietnamese	VN open source

Chat template

Mỗi model family có chat template riêng. Đây là cách convert messages array thành single string trước khi tokenize.

Llama-3 template:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Translate to French: Hello, world<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Bonjour, le monde<|eot_id|>

Mistral template:

<s>[INST] Translate to French: Hello, world [/INST] Bonjour, le monde</s>

Qwen template:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Translate to French: Hello, world<|im_end|>
<|im_start|>assistant
Bonjour, le monde<|im_end|>

Special token (<|eot_id|>, [INST], <|im_end|>) báo cho model biết “boundary giữa các role”. Quan trọng là dùng đúng template cho mỗi model.

HuggingFace tokenizer có method apply_chat_template() xử lý:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Translate to French: Hello, world"},
    {"role": "assistant", "content": "Bonjour, le monde"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False)
print(text)

Method này tự lookup template của tokenizer (lưu trong tokenizer_config.json) và format messages. Dev không cần biết special token nào của model nào.

Loss masking, key trick của SFT

Trong pretraining, model learn predict mọi token. Trong SFT, model chỉ nên learn predict response, không learn predict instruction.

Tại sao? Vì user instruction là input, không phải output. Nếu compute loss trên instruction tokens, model sẽ học cách “generate instruction” giống user, là không cần thiết.

Cách implement: mask loss của instruction tokens, chỉ tính loss của response tokens.

def prepare_sft_example(messages, tokenizer, max_length=2048):
    # Encode toàn bộ conversation
    full_text = tokenizer.apply_chat_template(messages, tokenize=False)
    full_ids = tokenizer(full_text, max_length=max_length, truncation=True)["input_ids"]

    # Tìm vị trí của assistant response trong full_ids
    # Mọi token TRƯỚC assistant response -> mask (label = -100)
    # Mọi token CỦA assistant response -> giữ (label = token id)

    labels = [-100] * len(full_ids)

    # Tìm assistant start (cụ thể theo template)
    # Llama-3: <|start_header_id|>assistant<|end_header_id|>\n\n
    assistant_start_token = tokenizer.encode("<|start_header_id|>assistant<|end_header_id|>", add_special_tokens=False)

    # Loop find first occurrence và mark labels từ đó về cuối
    for i in range(len(full_ids) - len(assistant_start_token)):
        if full_ids[i:i+len(assistant_start_token)] == assistant_start_token:
            for j in range(i + len(assistant_start_token), len(full_ids)):
                labels[j] = full_ids[j]
            break

    return {"input_ids": full_ids, "labels": labels}

PyTorch cross_entropy mặc định ignore_index=-100, nên label = -100 sẽ bị skip trong loss computation.

Code trên đơn giản hoá. Production thường handle nhiều assistant turn trong multi-turn conversation, mỗi turn đều unmask.

Library trl (Transformers Reinforcement Learning) có class SFTTrainer tự handle masking. Recommend dùng đó thay vì implement tay.

Triển khai bằng trl

trl là library official của HuggingFace cho SFT/DPO/PPO. Wrap Trainer của transformers với feature SFT-specific.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

base_model = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(base_model)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype="bfloat16",
    device_map="auto",
)

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def format_dolly(example):
    messages = [
        {"role": "user", "content": f"{example['instruction']}\n\n{example['context']}".strip()},
        {"role": "assistant", "content": example["response"]},
    ]
    return {"messages": messages}

dataset = dataset.map(format_dolly).remove_columns(["instruction", "context", "response", "category"])

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

training_config = SFTConfig(
    output_dir="./sft-llama3-2-1b",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    bf16=True,
    logging_steps=10,
    save_steps=200,
    max_seq_length=2048,
    packing=True,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    args=training_config,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./sft-llama3-2-1b/final")

Vài điểm quan trọng:

packing=True: combine nhiều short example vào 1 sequence dài để tận dụng batch. Tăng throughput đáng kể.

max_seq_length=2048: truncate quá độ dài này. Llama-3 hỗ trợ 128K nhưng SFT không cần dài đó.

learning_rate=2e-4: cao hơn full fine-tune (1e-5) vì dùng LoRA. LoRA tolerate lr cao hơn.

num_train_epochs=1: 1 epoch đủ cho SFT phần lớn case. 2-3 epoch có thể overfit.

Đánh giá SFT model

Sau training, làm sao biết model tốt? Có 3 cách:

1. Loss curve. Training loss giảm smooth, eval loss không tăng (overfit) là dấu hiệu tốt. Loss cuối ~0.5-1.2 là phổ biến.

2. Manual eval. Generate vài câu, đọc xem có instruction-following không.

from transformers import pipeline

pipe = pipeline("text-generation", model="./sft-llama3-2-1b/final", tokenizer=tokenizer)

messages = [{"role": "user", "content": "Viết một bài thơ ngắn về Hà Nội"}]
out = pipe(messages, max_new_tokens=200, do_sample=True, temperature=0.7)
print(out[0]["generated_text"])

3. Benchmark. Chạy benchmark public như MT-Bench, AlpacaEval. Cho điểm so với base model và GPT-3.5.

Một dev có lần SFT một model 1B và xem loss giảm xuống 0.4 thì kết luận “model tốt”. Khi inference thực tế, model generate giống template, mà không actually understand instruction (overfit loss thấp). Bài học: loss không đủ, phải eval generation thực.

Pitfall: dataset quality > dataset size

Một experiment thực tế: SFT Llama-3-1B trên 2 dataset:

Dataset A: 50K samples, mixed quality (một số sample noise, format không nhất quán)
Dataset B: 5K samples, hand-curated, high quality

Dataset A: training loss 0.6, eval thực tế trung bình. Dataset B: training loss 0.8, eval thực tế xuất sắc.

Tại sao? Vì SFT không thêm kiến thức, chỉ teach format. 5K example chất lượng cao đủ teach format. 50K example noise dạy format noise.

Phát hiện này được paper “LIMA: Less Is More for Alignment” (Meta 2023) chứng minh: chỉ cần 1000 example hand-curated là đủ teach một pretrained model behave như assistant tốt. Dataset lớn không thay thế chất lượng.

Rule of thumb cho SFT dataset:

Yếu tố	Important
Quality (format consistent, không noise)	Cực kỳ
Diversity (đa dạng task)	Rất
Size (số sample)	Trung bình
Length per sample	Trung bình

Curate 1K-10K example tốt > scrape 100K example trung bình.

Ghi nhanh

Thành phần SFT pipeline
Dataset (instruction-response pairs)
Chat template (model-specific)
Tokenizer với apply_chat_template()
Loss masking (chỉ compute loss trên response)
SFTTrainer (trl library)
LoRA / QLoRA (memory efficient)

Hyperparameter SFT phổ biến	Value
Learning rate	1e-5 (full FT) đến 2e-4 (LoRA)
Batch size effective	32-128 samples
Epochs	1-3
Warmup ratio	3-5%
LR scheduler	cosine
Max seq length	1024-4096
Weight decay	0 hoặc 0.01

trl SFTConfig key options
`packing=True`
`max_seq_length=2048`
`dataset_text_field="text"` (nếu dataset có column text)
`formatting_func` (custom format function)
`dataset_kwargs={"add_special_tokens": False}`

Eval methods
Loss curve check
Manual generation test
MT-Bench, AlpacaEval (LLM judge)
Task-specific benchmark (translation BLEU, summarization ROUGE)

Chốt lại

SFT là bước “feel-good” trong alignment pipeline: data sạch, code rõ, loss giảm, model behavior cải thiện rõ rệt. Đây thường là điểm bắt đầu cho mọi dev muốn fine-tune model riêng. Hiểu SFT tốt cũng là tiền đề cho DPO và RLHF (bài 20), những kỹ thuật phức tạp hơn.

Hands-on song song:

Trên Colab free T4 16GB, copy code phần trl chạy với Llama-3.2-1B. Sau khoảng 30 phút sẽ có một adapter SFT. Save và load lại để test.
Tạo dataset của riêng bạn: 50 example instruction-response viết tay (10 phút) hoặc dùng gpt-3.5-turbo API generate (đắt một tí). Fine-tune với 50 example đó. Verify model học được pattern.
So sánh base model và SFT model với cùng 5 câu hỏi. Output base thường “tiếp tục text”, output SFT thường “trả lời câu hỏi”. Khác biệt rất rõ.
Đọc paper LIMA (Zhou et al., 2023, “Less Is More for Alignment”). Paper ngắn 9 trang, đọc 30 phút. Argument là phần lớn knowledge đã ở pretraining, SFT chỉ teach format.

Bài 20 sẽ vào DPO và RLHF: làm sao train model trên preference data (response A tốt hơn response B). Đây là bước 2 và 3 trong alignment pipeline ChatGPT/Claude dùng. Phức tạp hơn SFT nhưng cho quality cao hơn nhiều.