Mixture of Experts (MoE): Mixtral, DeepSeek architecture

Mixtral-8x7B công bố cuối 2023 đã làm giới ML khó hiểu. Trong tên có “8x7B” tức 56B, nhưng số params thực tế là 47B. Còn lúc inference, model chỉ “chạy” 13B params cho mỗi token. Một model 47B mà inference chỉ tốn như model 13B.

Một năm sau, DeepSeek-V3 đẩy thêm: 671 tỷ params total, 37 tỷ active. Model lớn nhất công bố open source mà inference cost không khủng khiếp.

Không có magic ở đây. Đó là Mixture of Experts (MoE), kiến trúc tách “model size” (kiến thức) khỏi “compute per token” (chi phí). Bài này đi từ ý tưởng cốt lõi tới chi tiết Mixtral và DeepSeek.

Dành cho: dev đã hiểu transformer block (bài 12) và muốn hiểu hướng scaling LLM hiện đại nhất.

Mental model: sparse activation

Transformer truyền thống (“dense”): mỗi forward pass, mọi neuron, mọi weight đều tham gia tính toán cho mỗi token. Compute tỉ lệ thuận với số params.

MoE: chỉ một phần nhỏ weight được kích hoạt cho mỗi token. Total params có thể rất lớn, nhưng compute per token vẫn nhỏ.

Analogy: thư viện 1 triệu cuốn sách. Dense LLM = mỗi câu hỏi đọc cả 1 triệu cuốn. MoE = mỗi câu hỏi router (thủ thư) chỉ lấy 5 cuốn liên quan, đọc 5 cuốn đó. Kết quả tương đương vì hầu hết câu hỏi không cần toàn bộ thư viện.

Dense model 47B:
  mỗi token -> kích hoạt 47B params -> 47B FLOPs

MoE 47B (Mixtral):
  mỗi token -> router chọn 2/8 expert -> kích hoạt 13B params -> 13B FLOPs

Cùng kiến thức, 4 lần ít compute.

Phần 1: Kiến trúc cơ bản

Trong Transformer dense, mỗi block có hai phần chính: attention và MLP (feed-forward).

MoE thay thế MLP trong các block bằng MoE layer. Attention giữ nguyên dense.

Dense block:
  Input -> [Attention] -> [MLP] -> Output

MoE block:
  Input -> [Attention] -> [Router + N experts] -> Output

Router là một mạng nhỏ (thường 1 linear layer) nhận input vector và quyết định: token này nên đi qua expert nào (chọn top-K trong N experts).

Experts là N MLP độc lập, mỗi cái có cấu trúc tương tự một MLP dense thường.

Token vector
    |
    v
[Router] -> "gửi token này tới expert 3 và expert 7"
    |
    v
Expert 3: MLP_3(token) -> out_3
Expert 7: MLP_7(token) -> out_7
    |
    v
Output = w_3 * out_3 + w_7 * out_7   (weighted sum)

w_3 và w_7 là weight do router sinh ra (softmax trên top-K).

Phần 2: Tại sao MoE scale tốt

Quan sát kinh nghiệm từ cộng đồng:

1. Cost per token không tăng theo total params. Mixtral 47B inference giống Mistral 13B (vì chỉ active 13B). Đầu tư memory cho experts không phải compute.

2. Quality vẫn scale với total params. Mixtral 47B chất lượng gần Llama-70B dense, dù chỉ active 13B. Trade-off đẹp.

3. Specialization tự nhiên. Sau training, các expert thường specialize ngầm: expert 1 chuyên code, expert 3 chuyên math, expert 5 chuyên ngôn ngữ tự nhiên. Router học cách dispatch.

4. Memory bottleneck thay vì compute bottleneck. Tốc độ inference giới hạn bởi memory bandwidth (load weights từ VRAM), không bởi FLOPs. MoE có nhiều params hơn nhưng vẫn cần load toàn bộ vào VRAM. Vì vậy MoE thường chạy chậm hơn dense cùng active params trên consumer hardware.

Phần 3: Mixtral, công thức 8x7B

Mixtral-8x7B (Mistral AI, 2023) là MoE open source đầu tiên gây chấn động. Cấu trúc:

32 transformer block
Mỗi block: attention dense + MoE layer
MoE layer: 8 experts, mỗi expert ~5.6B params
Router chọn top-2 experts cho mỗi token
Total: 47B params (không phải 56B vì attention dense share + một số layer khác share)
Active: ~13B params per token

Pseudocode:

class MoELayer(nn.Module):
    def __init__(self, dim, num_experts=8, top_k=2):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.router = nn.Linear(dim, num_experts)
        self.experts = nn.ModuleList([
            FeedForward(dim) for _ in range(num_experts)
        ])

    def forward(self, x):
        # x shape: [batch, seq_len, dim]
        # Router scores
        scores = self.router(x)  # [batch, seq_len, num_experts]
        top_k_scores, top_k_idx = scores.topk(self.top_k, dim=-1)
        top_k_weights = F.softmax(top_k_scores, dim=-1)

        # Mỗi token đi qua top_k experts được chọn
        output = torch.zeros_like(x)
        for i in range(self.top_k):
            expert_idx = top_k_idx[..., i]
            weight = top_k_weights[..., i:i+1]
            for e in range(self.num_experts):
                mask = (expert_idx == e)
                if mask.any():
                    expert_out = self.experts[e](x[mask])
                    output[mask] += weight[mask] * expert_out
        return output

Implementation thực tế dùng kỹ thuật như scatter_add và Triton kernels để chạy nhanh, nhưng logic là cái này.

Phần 4: DeepSeek, đẩy MoE đi xa

DeepSeek-V2 (2024) và V3 (cuối 2024) đẩy MoE lên scale lớn hơn nhiều. DeepSeek-V3 specs:

671B total params
37B active per token
256 experts per MoE layer (so với 8 của Mixtral)
Top-8 routing (so với top-2)
Shared experts (1 expert luôn luôn active cho mọi token)

Innovation chính của DeepSeek:

1. Fine-grained experts. Nhiều expert nhỏ thay vì ít expert lớn. 256 expert mỗi cái 0.6B vs 8 expert mỗi cái 5.6B. Cho phép specialization tốt hơn.

2. Shared experts. 1-2 expert luôn active cho mọi token, học kiến thức chung. Các expert khác học kiến thức chuyên biệt. Giảm redundancy.

3. Auxiliary-loss-free load balancing. Một vấn đề cố hữu của MoE: router có xu hướng “yêu” một số expert (chúng được train nhiều hơn, càng tốt hơn, càng được dùng nhiều), bỏ rơi expert khác. Auxiliary loss truyền thống thêm penalty cho imbalance. DeepSeek đề xuất cơ chế không cần auxiliary loss, đơn giản hơn.

4. Multi-token prediction. Train model dự đoán cả 2 token tiếp theo cùng lúc, không chỉ 1. Tăng signal trong training, không tăng inference cost (bỏ predictor thứ 2 lúc inference).

Kết quả: DeepSeek-V3 chất lượng tương đương GPT-4o ở nhiều benchmark, train cost ước tính ~$5.5M (so với hàng trăm triệu đô của GPT-4).

Phần 5: Load balancing, vấn đề thực tế nhất

Quan sát kinh nghiệm trong training MoE: router thiên vị. Sau vài nghìn step, 80% token chạy qua 20% expert, 80% expert còn lại chỉ thấy vài trăm token. Dead expert: nhiều expert “chết”, không học được gì.

Hai cách giải:

Auxiliary loss (Switch Transformer 2021): thêm vào loss term penalty cho imbalance.

def aux_loss(router_logits, top_k_idx):
    # Mỗi expert được chọn bao nhiêu lần
    expert_counts = torch.bincount(top_k_idx.flatten(), minlength=num_experts)
    fraction_per_expert = expert_counts / expert_counts.sum()

    # Mỗi expert nhận bao nhiêu router probability tổng cộng
    router_probs = F.softmax(router_logits, dim=-1)
    mean_prob = router_probs.mean(dim=[0, 1])

    return num_experts * (fraction_per_expert * mean_prob).sum()

Loss này lớn khi imbalance, nhỏ khi balance. Train cùng với main loss.

Auxiliary-loss-free (DeepSeek-V3): dùng learnable bias per expert. Khi một expert được chọn quá nhiều, tăng bias âm cho nó, giảm khả năng nó được chọn ở step sau. Tự balance không cần loss term riêng.

Cả hai phương pháp đều có overhead, nhưng cần thiết. Không có load balancing -> MoE training hỏng.

Phần 6: Inference MoE, có khó hơn dense không

Có. Hai vấn đề thực tế:

1. Memory bandwidth. Mỗi token cần load weights của K expert được chọn. Khác token khác = expert khác = pattern truy cập memory chaos. GPU cache không tận dụng được.

2. Communication trong distributed serving. Khi serve MoE qua nhiều GPU (expert parallelism), token phải gửi tới GPU chứa expert phù hợp. Latency tăng vì all-to-all communication.

Engine optimize cho MoE:

vLLM hỗ trợ MoE từ 2024, tối ưu kernel cho Mixtral, DeepSeek
SGLang có cấu trúc tốt cho MoE
TGI (HuggingFace) hỗ trợ MoE

Pitfall: chạy MoE với engine cũ -> tốc độ tệ hơn cả dense cùng size. Update engine.

Phần 7: Khi nào MoE tốt hơn dense

Quy tắc:

MoE tốt hơn khi:

Có nhiều VRAM (cần load toàn bộ params dù chỉ active ít)
Có nhiều training compute (MoE train tốn FLOPs comparable với dense cùng total params)
Cần model “biết nhiều” (lots of knowledge, varied tasks)
Throughput inference quan trọng (active params ít -> compute ít)

Dense tốt hơn khi:

VRAM hạn chế (consumer GPU)
Latency single-request quan trọng
Task narrow, model nhỏ đủ
Train với budget hạn chế (MoE đòi hỏi tuning load balancing kỹ)

Quy tắc thực tế: deploy cho consumer (laptop, single GPU) thì dense Q4 thường tốt hơn MoE FP16. Deploy cho data center thì MoE thắng.

Phần 8: Pitfall thực tế

Pitfall 1: So sánh MoE với dense theo total params.

Mixtral 47B không tương đương Llama 47B dense. Mixtral active 13B, gần với Llama 13B hơn. So sánh theo active params, không total params.

Heuristic: MoE quality ~ dense với params = sqrt(total × active). Mixtral 47B (active 13B): quality ~ dense 25B. Con số này là approximation từ benchmark, không phải định luật.

Pitfall 2: Run MoE trên engine không hỗ trợ.

Engine cũ load MoE như dense, chậm khủng khiếp. Hugging Face Transformers naive có hỗ trợ MoE nhưng kernel không tối ưu. Trên A100, Mixtral chạy 15 tok/s với naive vs 100+ tok/s với vLLM.

Check trước khi deploy: engine có hỗ trợ MoE-specific kernel không?

Pitfall 3: Fine-tune MoE giống dense.

Fine-tune MoE phức tạp hơn dense. Router cũng học, nếu fine-tune mạnh có thể phá load balancing. LoRA trên MoE: có thể chỉ áp dụng cho experts (không cho router) hoặc cho router (không cho experts), tuỳ mục đích.

Khi không chắc: bắt đầu với LoRA rank thấp (8-16), monitor router behavior.

Pitfall 4: Tin số “active params” mù quáng cho memory.

Mixtral active 13B nhưng VRAM cần ~26GB FP16 (load toàn bộ 47B weights). Không thể chạy Mixtral trên RTX 4090 24GB FP16. Phải quantize.

Active params = compute per token. Total params = memory footprint. Hai số khác nhau, đừng nhầm.

Checklist nhanh

Khái niệm	Bản chất
MoE	Sparse model, mỗi token chỉ active K/N experts
Router	Mạng nhỏ quyết định expert cho mỗi token
Top-K routing	Mỗi token đi qua K expert (thường K=2)
Total params	Memory footprint
Active params	Compute per token
Load balancing	Tránh router thiên vị
Auxiliary loss	Penalty cho imbalance
Shared experts	Expert luôn active, học kiến thức chung
Expert parallelism	Phân tán experts qua nhiều GPU

Heuristic:

Mixtral 47B ~ Llama 25B quality, ~ Llama 13B compute
DeepSeek 671B ~ GPT-4 quality, ~ Llama 70B compute
Total / active ratio = “knowledge density”
VRAM cần = total params × precision_bytes (không phải active)

Chốt lại

MoE là một trong những trick scaling quan trọng nhất 5 năm qua. Hiểu MoE giúp bạn:

Đọc paper modern (Mixtral, DeepSeek, GPT-4 rumored MoE)
Chọn model phù hợp deploy theo hardware
Hiểu tại sao một model “lớn” có thể inference nhanh
Không bị shock bởi naming convention “8x7B” hay “671B”

Hands-on cho bạn:

Download Mixtral-8x7B-Instruct GGUF Q4_K_M (~26GB) hoặc DeepSeek-V2-Lite (~15GB). Load với llama.cpp. Đo tốc độ inference.
So sánh với một model dense cùng “active size” (Llama-13B Q4 cho Mixtral). Quality vs speed.
Đọc code Mixtral implementation trong HuggingFace Transformers (file modeling_mixtral.py), tìm class MixtralSparseMoeBlock. Đọc kỹ phần routing.
Đọc DeepSeek-V3 technical report. Đặc biệt section về fine-grained experts và auxiliary-loss-free balancing.

Phần tiếp theo là Long context: RoPE scaling, YaRN, ALiBi extrapolation. Llama-3 train context 8k nhưng phiên bản long-context chạy được 128k. Điểm mấu chốt nằm ở position encoding, không phải cứ nhét thêm token là model hiểu hết.