The Holy Grail
130 chapters and 5 appendices covering ML foundations, training, inference internals, production serving, CUDA programming, model optimization, retrieval, agents, distributed systems, observability, build/deploy, and ML system design interviews. The ML track ramps Beginner → Expert → PhD → Production; every chapter ends with curated references and practice problems.
- Chapters
- 130
- Parts
- 10
- Appendices
- 5
- Questions
- 530+
The interview-prep path
56 chaptersThe focused track for ML systems interviews. Covers the concepts interviewers actually test, skipping material that's important for production but rarely asked about. Start here if you have 2–4 weeks. Read front to back if you have more time.
Also essential: Appendix E — 530+ interview questions organized by topic, difficulty, and role with "what they want to hear" rubrics.
Contents
Ten parts plus five appendices. Click any chapter to start reading.
ML Foundations
10 chapters · Chapters 1–10
- 1 The mathematical objects: tensors, shapes, broadcasting
- 2 The forward pass: a neural network as a pure function
- 3 The backward pass: autograd, gradients, the chain rule made mechanical
- 4 Loss functions and optimization, in just enough depth
- 5 Tokens, vocabularies, and the tokenizer is the bug
- 6 Attention from first principles
- 7 The transformer end to end
- 8 The decoding loop: autoregressive generation, sampling, controllability
- 9 Embeddings and rerankers: what they are, why they're separate, why they're cheap
- 10 The model card: reading model lineage adversarially
Training, Fine-Tuning, Alignment
10 chapters · Chapters 11–20
- 11 Pretraining at scale: data, compute, curriculum
- 12 Distributed training: DDP, FSDP, ZeRO, tensor/pipeline/sequence parallel
- 13 Mixed precision training: fp16, bf16, fp8, loss scaling
- 14 Tokenizer training: BPE and SentencePiece from scratch
- 15 Fine-tuning: full, LoRA, QLoRA, adapters, the PEFT family
- 16 SFT and instruction tuning: turning a base model into an assistant
- 17 RLHF, DPO, KTO, Constitutional AI
- 18 Distillation, pruning, and training-time compression
- 19 Synthetic data and self-improvement loops
- 20 Evaluation: the hardest unsolved problem in ML
Inference Internals & Production Serving
36 chapters · Chapters 21–56
- 21 Prefill vs decode: the two-phase nature of LLM inference
- 22 The KV cache: the single most important optimization in LLM inference
- 23 Batching: static, dynamic, continuous (Orca)
- 24 PagedAttention and vLLM as a virtual-memory system for KV cache
- 25 FlashAttention and the GPU memory hierarchy
- 26 Quantization: INT8, INT4, FP8, AWQ, GPTQ, SmoothQuant
- 27 Speculative decoding: Medusa, EAGLE, MTP
- 28 Tensor, pipeline, expert, and sequence parallelism for inference
- 29 Prefix caching, prompt caching, radix attention
- 30 Cost modeling for inference: tokens, GPUs, dollars
- 31 Latency budgets, tail latency, and the p99 problem
- 32 Multimodal: vision-language, audio, the tokenizer trick
- 33 The attention compression family: MHA, MQA, GQA, MLA
- 34 Mixture of Experts: routing, balancing, and the inference cost story
- 35 Long context: RoPE, YaRN, position interpolation, ring attention, sparse attention
- 36 Disaggregated prefill/decode: production reality with workload-dependent payoff
- 37 KV cache compression and offload: LMCache, RDMA, NVMe tiering
- 38 Hardware-aware kernel design: CUDA, CUTLASS, Triton, TVM
- 39 CUDA programming for ML engineers
- 40 Model optimization and compilation
- 41 State-space models: Mamba, Mamba-2, hybrid architectures
- 42 Test-time compute and reasoning models: o1, R1, MCTS-decoding
- 43 Structured generation: guided decoding, JSON mode, regex constraints, FSM masking
- 44 The serving framework landscape: vLLM, SGLang, TensorRT-LLM, TGI, llama.cpp, MLC, Triton
- 45 Inference servers and orchestration: KServe, BentoML, Seldon, Ray Serve, Triton Inference Server
- 46 Composing an inference platform: the umbrella pattern
- 47 KServe InferenceService anatomy: runtime, predictor, transformer, autoscaling
- 48 vLLM in production: every flag that matters
- 49 TEI for embeddings and rerankers in production
- 50 AI gateways: Envoy AI Gateway and the OpenAI-compatible front door
- 51 Autoscaling GPU inference with KEDA
- 52 The model cold-start problem and pre-cached weights
- 53 KV cache sharing across replicas: LMCache and friends
- 54 Warmup, readiness probes, and the model-isn't-ready-yet problem
- 55 Benchmarking inference: methodology, tools, gotchas
- 56 Content safety as inference: guardrails architecture
Information Retrieval & RAG
9 chapters · Chapters 57–65
- 57 Information retrieval primer: TF-IDF, BM25, why they still matter
- 58 Dense retrieval: embeddings, contrastive learning, MTEB
- 59 Vector index internals: HNSW, IVF, ScaNN, FAISS
- 60 Hybrid search and fusion
- 61 Chunking strategies
- 62 Reranking with cross-encoders
- 63 Query rewriting, HyDE, multi-query, query decomposition
- 64 RAG evaluation: Ragas, LLM-as-judge, golden sets
- 65 Designing a RAG system end to end
Agents, Tool Use, Workflow Orchestration
7 chapters · Chapters 66–72
- 66 Tool calling and function calling: the wire protocols
- 67 The agent loop: ReAct, plan-and-execute, reflection
- 68 Multi-agent patterns and when they're worth it
- 69 MCP: the Model Context Protocol in depth
- 70 Workflow vs agent: when to use durable execution vs an LLM loop
- 71 Production agent failure modes
- 72 Designing an agent orchestration layer
Distributed Systems & Request Lifecycle
12 chapters · Chapters 73–84
- 73 The unified gateway pattern: one API in front of many backends
- 74 AuthN vs AuthZ: the distinction interviewers test on
- 75 Identity propagation across service boundaries
- 76 Rate limiting algorithms: token bucket, leaky bucket, sliding window, GCRA
- 77 Backpressure, flow control, and queue theory
- 78 Idempotency keys and exactly-once semantics
- 79 Sync, async, SSE, WebSocket, batch: five execution modes
- 80 Workflow orchestration: Temporal vs Airflow vs Step Functions vs Cadence
- 81 Inter-service trust patterns
- 82 The operations service pattern: why job lifecycle is its own SoR
- 83 Metering and billing pipelines
- 84 Telemetry firehoses: Kafka as the backbone
The Data Plane
7 chapters · Chapters 85–91
- 85 Object storage primer: the S3 model
- 86 Document stores: MongoDB and DynamoDB compared
- 87 Time-series databases: TimescaleDB, Prometheus TSDB, InfluxDB
- 88 Streaming data: Kafka, Strimzi, the broader log model
- 89 Caching patterns: Redis, cache-aside, write-through, write-behind
- 90 The lakehouse story: Parquet, Iceberg, Delta, Hudi
- 91 Feature stores: Feast, Tecton, the offline/online split
Observability, Reliability, Incidents
9 chapters · Chapters 92–100
- 92 The four golden signals, RED, USE, LETS
- 93 Metrics: Prometheus internals
- 94 Logs: Loki, structured logging, sampling, retention economics
- 95 Distributed tracing: OpenTelemetry, Jaeger, Tempo
- 96 Continuous profiling: Pyroscope, Parca
- 97 SLI, SLO, SLA, error budgets
- 98 Canary patterns: traffic-shifted, statistical, ML model canaries
- 99 Incident management: postmortems, blameless culture
- 100 ORR: what production-readiness actually means
Build, Deploy, Operate
13 chapters · Chapters 101–113
- 101 Build systems for monorepos: Bazel, Pants, Buck, Nx
- 102 Container fundamentals: namespaces, cgroups, OCI spec
- 103 Dependency injection patterns: Wire, manual, runtime DI
- 104 API contract design: OpenAPI vs gRPC vs GraphQL vs Connect
- 105 Python tooling: uv, ruff, black
- 106 The OCI image lifecycle: registries, digest pinning, the digest-update pattern
- 107 GitOps philosophy: ArgoCD, Flux, the App-of-Apps pattern
- 108 Helm vs Kustomize vs CDK8s
- 109 Multi-cluster, multi-region, multi-cell architecture
- 110 IaC: Terraform, Pulumi, CDK
- 111 Secrets management: 1Password, Vault, External Secrets Operator
- 112 Edge ingress: Cloudflare Tunnels, Ingress controllers, service meshes
- 113 CI as a system: path filters, per-service builds, coverage gates
ML System Design Interview Playbook
17 chapters · Chapters 114–130
- 114 The interview framework: clarify, estimate, design, drill, ops
- 115 Capacity planning math: the back-of-envelope kit
- 116 Design a chatbot for 1M users
- 117 Design a RAG system over 10TB of documents
- 118 Design a content moderation pipeline
- 119 Design a real-time recommendation system
- 120 Design a multi-tenant model serving platform
- 121 Design a multi-tenant fine-tuning service
- 122 The vocabulary that interviewers respect
- 123 Common mistakes and how to recover mid-interview
- 124 The coding interview: twenty ML systems algorithms
- 125 Design: the upstream platforms
- 126 Design: the production infrastructure
- 127 Design: the frontier scenarios
- 128 Behavioral interviews and the levels ladder
- 129 The day of the interview
- 130 Company-specific prep and mock interview transcripts