Model Serving Altyapısı

01 LLM Serving

vLLM & TensorRT-LLM

PagedAttention ile sıfır KV cache fragmentasyonu. Continuous batching, tensor parallelism, speculative decoding ve TensorRT-LLM ile NVIDIA optimize edilmiş inference. Llama-3 üzerinde throughput benchmark.

10 bölüm Python · CUDA vLLM · TRT-LLM

Rehbere git

02 LLM Serving

Triton Inference Server

NVIDIA Triton ile üretim sınıfı model serving. config.pbtxt konfigürasyonu, dynamic batching, ensemble pipeline, Python backend custom logic ve perf_analyzer ile throughput optimizasyonu.

10 bölüm Python · ONNX Triton · GRPC

Rehbere git

03 Framework

BentoML

Framework-agnostic model paketleme ve servisi. Service API, adaptive batching, runner async dispatch, bentofile.yaml ile Bento build, Docker containerize ve Kubernetes deployment pipeline.

10 bölüm Python · Docker BentoML · K8s

Rehbere git

04 Framework

GPU Cluster Yönetimi

Ray Serve ile çok GPU'lu LLM deployment. Pipeline ve tensor parallelism, model sharding stratejileri, autoscaling politikaları, spot instance yönetimi ve NCCL communication backend.

10 bölüm Python · CUDA Ray · Megatron

Rehbere git

05 Operasyon

A/B Test & Canary Deploy

Üretimde güvenli model değiştirme. Shadow mode, istatistiksel anlamlılık, canary rollout (5%→100%), otomatik rollback ve multi-armed bandit ile dinamik model routing.

10 bölüm Python · FastAPI Prometheus · Rollback

Rehbere git