LLM from Scratch — nanoGPT & Pretraining

00 Neden Sıfırdan? — GPT-2'nin Anatomy'si

Sıfırdan LLM oluşturmak, hangi mimarinin neden işe yaradığını gerçekten anlamanın tek yoludur. GPT-2, bu yolculuk için ideal başlangıç noktasıdır.

GPT-2 (Radford et al., 2019), kausal dil modeli (next-token prediction) olarak eğitilmiş bir Transformer'dır. GPT-4 gibi büyük modeller de aynı temel prensipleri kullanır; fark ölçek ve fine-tuning aşamalarındadır. Sıfırdan uygulama, şu soruların cevabını somutlaştırır: Attention neden çalışıyor? LayerNorm neden önemli? Weight tying ne anlama geliyor? Residual bağlantılar olmasa ne olur?

GPT-2'nin bileşenleri: (1) Token Embedding tablosu. (2) Pozisyonel Encoding. (3) N adet Transformer block (attention + FFN + LayerNorm + residual). (4) Final LayerNorm. (5) LM head (embedding boyutundan vocab'a lineer projeksiyon). GPT-2 small: 12 katman, 12 başlık, 768 gizli boyut, 124M parametre.

Model	Katman	Başlık	d_model	Parametre
GPT-2 small	12	12	768	124M
GPT-2 medium	24	16	1024	355M
GPT-2 large	36	20	1280	774M
GPT-2 XL	48	25	1600	1.5B

01 Bigram Language Model — Karakter Düzeyinde Tahmin

En basit dil modeli olan bigram, her karakterin yalnızca bir önceki karaktere bağlı olduğunu varsayar ve attention'ı anlamak için sezgisel temel oluşturur.

Bigram modeli, karakter düzeyinde frekans istatistiğidir: P(c_t | c_{t-1}). Derin öğrenme versiyonunda bu tablo bir embedding lookup ile öğrenilir. Karmaşıklık metriği cross-entropy loss: düzgün seçilmiş rastgele başlangıçta log(vocab_size) (Türkçe karakterler için ~4.0). Eğitim sonrası bu değer düşer; ne kadar düşerse model o kadar iyi.

bigram.py●

import torch
import torch.nn as nn
import torch.nn.functional as F

# ── Veri hazırlama ───────────────────────────────────────────
with open("shakespeare.txt", "r", encoding="utf-8") as f:
    text = f.read()

chars = sorted(set(text))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

# ── Batch üretici ────────────────────────────────────────────
block_size = 8
batch_size = 32

def get_batch(split):
    d = train_data if split == "train" else val_data
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i:i+block_size] for i in ix])
    y = torch.stack([d[i+1:i+block_size+1] for i in ix])
    return x, y

# ── Bigram modeli ────────────────────────────────────────────
class BigramLM(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.token_embedding(idx)  # (B, T, C)
        if targets is None:
            return logits, None
        B, T, C = logits.shape
        loss = F.cross_entropy(logits.view(B*T, C),
                                 targets.view(B*T))
        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, _ = self(idx)
            logits = logits[:, -1, :]  # son token (B, C)
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat([idx, next_token], dim=1)
        return idx

model = BigramLM(vocab_size)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

for step in range(3000):
    xb, yb = get_batch("train")
    _, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % 500 == 0:
        print(f"step {step}: loss={loss.item():.4f}")

02 Self-Attention Sıfırdan — Q, K, V, Scaled Dot Product

Self-attention, her tokenun tüm geçmiş tokenlerle ilişki kurmasına izin vererek bigram'ın sınırlı bağlamını ortadan kaldırır.

Attention mekanizması: Attention(Q,K,V) = softmax(QK^T / √d_k) × V. Query (Q): "ne arıyorum?", Key (K): "ne sunuyorum?", Value (V): "ne biliyorum?". Ölçekleme faktörü √d_k, büyük boyutlarda dot product değerlerinin çok büyümesini ve gradyanların vanish olmasını önler. Kausal maskeleme (üçgen maske), gelecek tokenların görülmesini engeller.

self_attention.py●

import torch
import torch.nn as nn
import torch.nn.functional as F

# ── Matematiksel sezgi: ağırlıklı ortalama ───────────────────
# Naif yol: her token, tüm geçmiş tokenların ortalamasını alır
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

# Kümülatif ortalama (attention öncesi baseline)
xbow = torch.zeros(B, T, C)
for b in range(B):
    for t in range(T):
        xbow[b, t] = x[b, :t+1].mean(dim=0)

# Vektörleştirilmiş: alt üçgen maske ile
tril = torch.tril(torch.ones(T, T))
wei = tril / tril.sum(dim=1, keepdim=True)
xbow_fast = wei @ x   # (T,T) @ (B,T,C) → (B,T,C) via broadcasting
print(torch.allclose(xbow, xbow_fast, atol=1e-7))  # True

# ── Gerçek Self-Attention ─────────────────────────────────────
class SelfAttentionHead(nn.Module):
    def __init__(self, d_model, head_size):
        super().__init__()
        self.key   = nn.Linear(d_model, head_size, bias=False)
        self.query = nn.Linear(d_model, head_size, bias=False)
        self.value = nn.Linear(d_model, head_size, bias=False)
        # Kausal maske — eğitimde değişmez (register_buffer ile)
        self.register_buffer("tril",
            torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(0.1)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)    # (B, T, hs)
        q = self.query(x)  # (B, T, hs)
        # Scaled dot product
        scale = k.shape[-1] ** -0.5
        wei = q @ k.transpose(-2, -1) * scale  # (B, T, T)
        # Kausal maskeleme: gelecek pozisyonları -inf yap
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf"))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        v = self.value(x)   # (B, T, hs)
        out = wei @ v         # (B, T, hs)
        return out

03 Multi-head Attention & Causal Mask

Çok başlı dikkat, farklı başların farklı ilişki türlerini yakaladığı zengin bir temsil uzayı oluşturur.

Multi-head attention, H adet bağımsız attention başını paralel çalıştırır ve çıktıları birleştirir: MHA(Q,K,V) = Concat(head_1,...,head_H) W^O. Her başın boyutu d_head = d_model / H olacak şekilde ayarlanır; bu sayede toplam hesaplama maliyeti tek başlıkla aynı kalır. Flash Attention (Dao et al. 2022), attention matrisini bellekte somutlaştırmadan hesaplayarak hem bellek hem hız açısından büyük kazanım sağlar.

multihead_attention.py●

import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads, block_size, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_head  = d_model // n_heads

        self.qkv_proj = nn.Linear(d_model, 3 * d_model, bias=False)
        self.out_proj  = nn.Linear(d_model, d_model, bias=False)
        self.dropout   = nn.Dropout(dropout)

        self.register_buffer(
            "causal_mask",
            torch.tril(torch.ones(block_size, block_size))
                  .view(1, 1, block_size, block_size)
        )

    def forward(self, x):
        B, T, C = x.shape
        H, D = self.n_heads, self.d_head

        qkv = self.qkv_proj(x)                       # (B, T, 3C)
        q, k, v = qkv.split(C, dim=-1)              # her biri (B,T,C)
        # Başlara böl: (B, H, T, D)
        q = q.view(B, T, H, D).transpose(1, 2)
        k = k.view(B, T, H, D).transpose(1, 2)
        v = v.view(B, T, H, D).transpose(1, 2)

        # Scaled dot product + causal mask
        scale = D ** -0.5
        att = (q @ k.transpose(-2, -1)) * scale   # (B,H,T,T)
        att = att.masked_fill(self.causal_mask[:, :, :T, :T] == 0,
                               float("-inf"))
        att = F.softmax(att, dim=-1)
        att = self.dropout(att)

        # Birleştir: (B,H,T,D) → (B,T,C)
        out = (att @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.out_proj(out)

# ── Flash Attention (PyTorch 2.0+) ───────────────────────────
# Aynı sonucu bellek açısından verimli üretir:
# out = F.scaled_dot_product_attention(q, k, v, is_causal=True)

04 Transformer Block — Attention + FFN + LayerNorm + Residual

Transformer block, attention ve feed-forward ağını residual bağlantılar ve pre-normalization ile birleştiren temel yapı taşıdır.

GPT bloğu Pre-LN tasarımı kullanır (orijinal Transformer'dan farklı olarak): normalizasyon alt katmandan önce uygulanır, bu daha stabil gradyan akışı sağlar. Residual bağlantılar (skip connections), gradyanların derinlere akmasını sağlar; olmadan derin ağlar eğitilemez. FFN, genellikle 4× büyütme faktörü kullanır: d_model → 4×d_model → d_model ve GELU aktivasyonu. LayerNorm, her örneği bağımsız normalize eder (BatchNorm aksine).

transformer_block.py●

import torch
import torch.nn as nn

class FeedForward(nn.Module):
    def __init__(self, d_model, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
            nn.Dropout(dropout),
        )

    def forward(self, x): return self.net(x)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, block_size, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads, block_size, dropout)
        self.ffn  = FeedForward(d_model, dropout)

    def forward(self, x):
        # Pre-LN residual: x + sublayer(LN(x))
        x = x + self.attn(self.ln1(x))  # Attention + residual
        x = x + self.ffn(self.ln2(x))   # FFN + residual
        return x

# ── GPT modeli: tüm bloklar ───────────────────────────────────
class GPT(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers,
                 block_size, dropout=0.1):
        super().__init__()
        self.block_size = block_size
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(block_size, d_model)
        self.drop    = nn.Dropout(dropout)
        self.blocks  = nn.Sequential(*[
            TransformerBlock(d_model, n_heads, block_size, dropout)
            for _ in range(n_layers)
        ])
        self.ln_f   = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        # Weight tying: s5'te açıklanıyor
        self.tok_emb.weight = self.lm_head.weight

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = self.drop(tok + pos)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            import torch.nn.functional as F
            loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
        return logits, loss

05 Tokenization — BPE Algoritması & tiktoken

Karakter düzeyinden kelime düzeyine geçişin ortası olan BPE, hem kelime içi bilgiyi hem sözcük sınırlarını yakalar ve tiktoken GPT modellerinde kullanılır.

BPE (Byte Pair Encoding), 1994'te veri sıkıştırma için önerilen ancak NLP'de tokenization için benimsenen bir algoritmdır. Adımlar: (1) Her karakter ayrı token. (2) En sık görülen bitişik çifti yeni token olarak birleştir. (3) İstenilen vocab_size'a ulaşana dek tekrarla. GPT-4 modelleri cl100k_base tokenizer kullanır: ~100K token. OpenAI'nin tiktoken kütüphanesi bunu Rust ile implement eder — Python tokenizers'tan 3-6x hızlı.

bpe.py●

# ── Minimal BPE implementasyonu ──────────────────────────────
from collections import Counter
from typing import List, Dict, Tuple

def get_pair_counts(vocab: Dict[str, int]) -> Counter:
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_pair(vocab: Dict, pair: Tuple) -> Dict:
    bigram = " ".join(pair)
    replacement = "".join(pair)
    new_vocab = {}
    for word, freq in vocab.items():
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = freq
    return new_vocab

def train_bpe(corpus: str, num_merges: int = 100):
    # Her kelimeyi karakterlere böl +  son ek
    words = corpus.split()
    vocab = Counter()
    for word in words:
        vocab[" ".join(list(word)) + " "] += 1

    merges = []
    for i in range(num_merges):
        pairs = get_pair_counts(vocab)
        if not pairs: break
        best = pairs.most_common(1)[0][0]
        vocab = merge_pair(vocab, best)
        merges.append(best)
        if i % 10 == 0:
            print(f"Merge {i}: {best[0]!r} + {best[1]!r}")

    return vocab, merges

# ── tiktoken kullanımı ───────────────────────────────────────
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4
text = "Büyük dil modelleri her yerde karşımıza çıkıyor."
tokens = enc.encode(text)
print(f"Token sayısı: {len(tokens)}")
print(f"Geri çözme: {enc.decode(tokens)!r}")

# Token görselleştirme
for t in tokens:
    print(f"  {t:6d}: {enc.decode([t])!r}")

06 Pozisyonel Encoding — Learned vs RoPE

Transformer'ın attention mekanizması sıraya duyarsızdır; pozisyonel encoding bu bilgiyi modele ekler. Öğrenilmiş encoding veya RoPE (Rotary Position Embedding) modern tercih.

Learned PE, GPT-2'nin kullandığı yaklaşımdır: pozisyon indeksi için ayrı bir embedding tablosu öğrenilir. Avantaj: basit. Dezavantaj: eğitimde görülmeyen uzun dizilere genelleme yapamaz. Sinusoidal PE (orijinal Transformer), formül tabanlıdır ve genellenebilir ama performans açısından öğrenilmiş PE'den genellikle zayıftır. RoPE (Su et al. 2021), query ve key vektörlerini relative pozisyona göre döndürür; hem long context'e iyi genellenebilir hem relative position encoding sağlar; Llama ve modern LLM'lerin standardı.

rope.py●

import torch
import torch.nn as nn

# ── RoPE implementasyonu ──────────────────────────────────────
def precompute_freqs_cis(dim: int, max_seq: int = 2048,
                          theta: float = 10000.0) -> torch.Tensor:
    """
    Çift sayıda boyut için frekans vektörleri ön hesaplaması.
    dim: head_dim (çift olmalı)
    """
    freqs = 1.0 / (theta ** (
        torch.arange(0, dim, 2).float() / dim
    ))
    t = torch.arange(max_seq)
    freqs = torch.outer(t, freqs)   # (max_seq, dim/2)
    return torch.polar(torch.ones_like(freqs), freqs)  # kompleks

def apply_rotary_emb(x: torch.Tensor,
                      freqs_cis: torch.Tensor) -> torch.Tensor:
    """
    x: (B, T, n_heads, head_dim)
    """
    xc = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
    freqs = freqs_cis[:x.shape[1]].unsqueeze(0).unsqueeze(2)
    xr = torch.view_as_real(xc * freqs).flatten(3)
    return xr.type_as(x)

# ── Kullanım ─────────────────────────────────────────────────
head_dim = 64
freqs_cis = precompute_freqs_cis(head_dim, max_seq=2048)

# Query ve Key'e RoPE uygula
B, T, H, D = 2, 128, 8, head_dim
q = torch.randn(B, T, H, D)
k = torch.randn(B, T, H, D)
q_rot = apply_rotary_emb(q, freqs_cis)
k_rot = apply_rotary_emb(k, freqs_cis)

# ── Learned PE (GPT-2 stili, referans) ───────────────────────
class LearnedPE(nn.Module):
    def __init__(self, block_size, d_model):
        super().__init__()
        self.pos_emb = nn.Embedding(block_size, d_model)

    def forward(self, seq_len):
        positions = torch.arange(seq_len)
        return self.pos_emb(positions)  # (T, d_model)

07 Training Loop — Cross-Entropy Loss, AdamW, Grad Clipping

LLM eğitiminin üç kritik unsuru: cross-entropy language modeling kaybı, AdamW optimizer ve gradient clipping ile stabil eğitim.

Cross-entropy language modeling kaybı: modelin bir sonraki tokeni doğru tahmin etme negatif log olasılığı. Perplexity = e^loss; daha düşük = daha iyi. AdamW, Adam optimizer'da weight decay'i düzgün uygular (gradyan ile değil parametreyle doğrudan): L2 regularization'ın düzgün formu. Gradient clipping (max_norm=1.0), büyük gradyan patlamalarını önler; LLM eğitiminde özellikle önemli. Learning rate schedule: warmup + cosine decay nanoGPT'in tercihidir.

training_loop.py●

import torch, math

# ── Model konfigürasyonu ─────────────────────────────────────
config = {
    "vocab_size":  vocab_size,
    "d_model":     384,
    "n_heads":     6,
    "n_layers":    6,
    "block_size":  256,
    "dropout":     0.2,
}
model = GPT(**config)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
print(f"Parametre: {sum(p.numel() for p in model.parameters())/1e6:.2f}M")

# ── AdamW: weight decay yalnızca 2D+ tensörlere ──────────────
decay_params    = [p for n, p in model.named_parameters()
                    if p.requires_grad and p.dim() >= 2]
no_decay_params = [p for n, p in model.named_parameters()
                    if p.requires_grad and p.dim() < 2]
optimizer = torch.optim.AdamW([
    {"params": decay_params,    "weight_decay": 0.1},
    {"params": no_decay_params, "weight_decay": 0.0},
], lr=3e-4, betas=(0.9, 0.95))

# ── Cosine LR schedule ────────────────────────────────────────
max_iters  = 5000
warmup_its = 100
min_lr     = 3e-5
max_lr     = 3e-4

def get_lr(it: int) -> float:
    if it < warmup_its:
        return max_lr * it / warmup_its
    t = (it - warmup_its) / (max_iters - warmup_its)
    coeff = 0.5 * (1.0 + math.cos(math.pi * t))
    return min_lr + coeff * (max_lr - min_lr)

# ── Eğitim döngüsü ───────────────────────────────────────────
for it in range(max_iters):
    lr = get_lr(it)
    for param_group in optimizer.param_groups:
        param_group["lr"] = lr

    model.train()
    xb, yb = get_batch("train")
    xb, yb = xb.to(device), yb.to(device)

    _, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()

    # Gradient clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

    if it % 200 == 0:
        model.eval()
        with torch.no_grad():
            xv, yv = get_batch("val")
            _, val_loss = model(xv.to(device), yv.to(device))
        print(f"iter {it:5d} | lr={lr:.2e} | train={loss:.4f} | val={val_loss:.4f}")

08 Weight Tying — Embedding ve LM Head Paylaşımı

Token embedding tablosu ile dil modeli kafasının ağırlıklarını paylaşmak parametre sayısını düşürür ve teorik olarak da anlamlıdır.

Weight tying, Press & Wolf (2017) tarafından önerilmiştir: input embedding ve output projection (LM head) matrislerinin aynı ağırlıkları paylaşması. Mantığı şudur: her iki matris de token'ları benzer vektör uzayında temsil etmeli. GPT-2, GPT-NeoX, Llama dahil neredeyse tüm modern LLM'ler bunu uygular. Etki: GPT-2 small için ~39M parametre tasarrufu (vocab_size × d_model = 50257 × 768).

weight_tying.py●

import torch
import torch.nn as nn

class GPTWithTying(nn.Module):
    def __init__(self, vocab_size, d_model, **kwargs):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        # ... diğer katmanlar ...
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Weight tying: aynı tensörü paylaş
        self.tok_emb.weight = self.lm_head.weight
        # Artık backward sırasında gradyanlar her ikisine de akar

    def count_params(self):
        total = sum(p.numel() for p in self.parameters())
        unique = sum(p.numel() for p in set(self.parameters()))
        return total, unique

# ── Doğrulama ────────────────────────────────────────────────
m = GPTWithTying(50257, 768)
total, unique = m.count_params()
tied_savings = total - unique
print(f"Toplam sayılı: {total/1e6:.1f}M")
print(f"Gerçek unique: {unique/1e6:.1f}M")
print(f"Tying tasarrufu: {tied_savings/1e6:.1f}M")

# Aynı nesneye işaret ettiklerini doğrula
print(m.tok_emb.weight is m.lm_head.weight)  # True

# ── Weight initialization ────────────────────────────────────
def init_weights(module):
    if isinstance(module, nn.Linear):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        if module.bias is not None:
            torch.nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

m.apply(init_weights)

09 nanoGPT Kodu Satır Satır — Karpathy Implementasyonu

Andrej Karpathy'nin nanoGPT'i 300 satırda tam GPT implementasyonu sunar; her satır bir kavramı somutlaştırır.

nanoGPT (github.com/karpathy/nanoGPT), GPT-2 eğitimini minimal kod ile gerçekleştirir. model.py'daki CausalSelfAttention, öğretici açıklığıyla yazılmıştır. train.py, gradient accumulation, mixed precision (bfloat16), DDP (DistributedDataParallel) ve model checkpoint destekler. GPT-2 ağırlıklarını yükleyip fine-tune etmek de mümkündür.

nanogpt_key_parts.py●

# nanoGPT'in kritik satırları — yorumlu özet
import torch, torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass

@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304  # GPT-2: 50257, 64'ün katı'na yuvarla
    n_layer:    int = 12
    n_head:     int = 12
    n_embd:     int = 768
    dropout:    float = 0.0
    bias:       bool = True

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # Q, K, V için tek matris — verimlilik
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
        # Flash Attention varsa kullan (PyTorch 2.0+)
        self.flash = hasattr(torch.nn.functional, "scaled_dot_product_attention")
        if not self.flash:
            self.register_buffer("bias",
                torch.tril(torch.ones(config.block_size, config.block_size))
                      .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.shape
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        if self.flash:
            # Flash Attention — kernel fuzyonu ile bellek ve hız
            y = F.scaled_dot_product_attention(
                q, k, v,
                attn_mask=None,
                dropout_p=self.dropout if self.training else 0,
                is_causal=True
            )
        else:
            att = (q @ k.transpose(-2, -1)) * (k.size(-1) ** -0.5)
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float("-inf"))
            att = F.softmax(att, dim=-1)
            y = att @ v

        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.c_proj(y)

10 Ölçekleme Deneyleri — Hyperparametre Sweep & TinyShakespeare

TinyShakespeare üzerinde sistematik hyperparametre araması ile ölçekleme yasalarını (scaling laws) pratikte gözlemleyin.

Chinchilla Scaling Laws (Hoffmann et al. 2022): optimal N (parametre) ve D (token) için N ≈ 20 × D / (20 × N) ilişkisi. Pratik anlam: 1B parametre model için ~20B token eğitim verisi optimal. TinyShakespeare (~1M token) bu deneyi küçük ölçekte simüle eder. Sweep parametreleri: d_model, n_layers, n_heads, learning_rate ve block_size.

scaling_sweep.py●

import itertools, json
from pathlib import Path

# ── Sweep konfigürasyonları ──────────────────────────────────
SWEEP = {
    "d_model":    [128, 256, 384],
    "n_layers":   [4, 6, 8],
    "n_heads":    [4, 8],     # d_model / n_heads tam bölünmeli
    "max_lr":     [1e-3, 3e-3],
}

def run_experiment(config: dict, max_iters: int = 2000) -> dict:
    d_model  = config["d_model"]
    n_heads  = config["n_heads"]
    if d_model % n_heads != 0:
        return None  # geçersiz kombinasyon

    model = GPT(
        vocab_size=vocab_size,
        d_model=d_model,
        n_heads=n_heads,
        n_layers=config["n_layers"],
        block_size=256,
    ).to(device)

    n_params = sum(p.numel() for p in model.parameters()) / 1e6

    optimizer = torch.optim.AdamW(
        model.parameters(), lr=config["max_lr"],
        weight_decay=0.1, betas=(0.9, 0.95)
    )

    best_val = float("inf")
    for it in range(max_iters):
        xb, yb = get_batch("train")
        _, loss = model(xb.to(device), yb.to(device))
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        if it == max_iters - 1:
            model.eval()
            with torch.no_grad():
                xv, yv = get_batch("val")
                _, val_loss = model(xv.to(device), yv.to(device))
                best_val = float(val_loss)

    return {**config, "n_params_M": n_params, "val_loss": best_val}

# ── Tüm kombinasyonları çalıştır ─────────────────────────────
results = []
keys = list(SWEEP.keys())
for combo in itertools.product(*SWEEP.values()):
    cfg = dict(zip(keys, combo))
    result = run_experiment(cfg)
    if result:
        results.append(result)
        print(result)

results.sort(key=lambda x: x["val_loss"])
print("\n=== En iyi 3 konfigürasyon ===")
for r in results[:3]:
    print(f"val_loss={r['val_loss']:.4f} | params={r['n_params_M']:.1f}M | {r}")

with open("sweep_results.json", "w") as f:
    json.dump(results, f, indent=2)

GÖZLEMLer

TinyShakespeare sweep'te tipik gözlemler: d_model=256, n_layers=6, n_heads=8, lr=1e-3 kombinasyonu genellikle Pareto-optimal noktada. Daha derin > daha geniş (katman eklmek, d_model büyütmekten genellikle daha iyi). Çok yüksek LR (3e-3) küçük modellerde unstable. nanoGPT-small: ~10M param, 2000 iter sonrası ~1.5 val loss.