A field-first mastery book on modern ML systems

The Holy Grail

130 chapters and 5 appendices covering ML foundations, training, inference internals, production serving, CUDA programming, model optimization, retrieval, agents, distributed systems, observability, build/deploy, and ML system design interviews. The ML track ramps Beginner → Expert → PhD → Production; every chapter ends with curated references and practice problems.

Start reading Interview prep path (56 chapters) Preface

Chapters: 130
Parts: 10
Appendices: 5
Questions: 530+

The interview-prep path

56 chapters

The focused track for ML systems interviews. Covers the concepts interviewers actually test, skipping material that's important for production but rarely asked about. Start here if you have 2–4 weeks. Read front to back if you have more time.

Part I — ML Foundations

1 The mathematical objects: tensors, shapes, broadcasting 5 Tokens, vocabularies, and the tokenizer is the bug 6 Attention from first principles 7 The transformer end to end 8 The decoding loop: autoregressive generation, sampling, controllability 9 Embeddings and rerankers: what they are, why they're separate, why they're cheap

Part II — Training

12 Distributed training: DDP, FSDP, ZeRO, tensor/pipeline/sequence parallel 15 Fine-tuning: full, LoRA, QLoRA, adapters, the PEFT family 17 RLHF, DPO, KTO, Constitutional AI 20 Evaluation: the hardest unsolved problem in ML

Part III — Inference

21 Prefill vs decode: the two-phase nature of LLM inference 22 The KV cache: the single most important optimization in LLM inference 23 Batching: static, dynamic, continuous (Orca) 24 PagedAttention and vLLM as a virtual-memory system for KV cache 25 FlashAttention and the GPU memory hierarchy 26 Quantization: INT8, INT4, FP8, AWQ, GPTQ, SmoothQuant 30 Cost modeling for inference: tokens, GPUs, dollars 31 Latency budgets, tail latency, and the p99 problem 33 The attention compression family: MHA, MQA, GQA, MLA 34 Mixture of Experts: routing, balancing, and the inference cost story 36 Disaggregated prefill/decode: production reality with workload-dependent payoff 44 The serving framework landscape: vLLM, SGLang, TensorRT-LLM, TGI, llama.cpp, MLC, Triton 45 Inference servers and orchestration: KServe, BentoML, Seldon, Ray Serve, Triton Inference Server

Part IV — Retrieval & RAG

57 Information retrieval primer: TF-IDF, BM25, why they still matter 58 Dense retrieval: embeddings, contrastive learning, MTEB 59 Vector index internals: HNSW, IVF, ScaNN, FAISS 62 Reranking with cross-encoders 65 Designing a RAG system end to end

Part V — Agents

66 Tool calling and function calling: the wire protocols 67 The agent loop: ReAct, plan-and-execute, reflection 70 Workflow vs agent: when to use durable execution vs an LLM loop

Part VI — Distributed Systems

73 The unified gateway pattern: one API in front of many backends 76 Rate limiting algorithms: token bucket, leaky bucket, sliding window, GCRA 77 Backpressure, flow control, and queue theory 78 Idempotency keys and exactly-once semantics 80 Workflow orchestration: Temporal vs Airflow vs Step Functions vs Cadence

Part VIII — Observability

92 The four golden signals, RED, USE, LETS 97 SLI, SLO, SLA, error budgets 99 Incident management: postmortems, blameless culture

Part X — Interview Playbook

114 The interview framework: clarify, estimate, design, drill, ops 115 Capacity planning math: the back-of-envelope kit 116 Design a chatbot for 1M users 117 Design a RAG system over 10TB of documents 118 Design a content moderation pipeline 119 Design a real-time recommendation system 120 Design a multi-tenant model serving platform 121 Design a multi-tenant fine-tuning service 122 The vocabulary that interviewers respect 123 Common mistakes and how to recover mid-interview 124 The coding interview: twenty ML systems algorithms 125 Design: the upstream platforms 126 Design: the production infrastructure 127 Design: the frontier scenarios 128 Behavioral interviews and the levels ladder 129 The day of the interview 130 Company-specific prep and mock interview transcripts

Also essential: Appendix E — 530+ interview questions organized by topic, difficulty, and role with "what they want to hear" rubrics.

Ten parts plus five appendices. Click any chapter to start reading.

Part I

ML Foundations

10 chapters · Chapters 1–10

Part II

Training, Fine-Tuning, Alignment

10 chapters · Chapters 11–20

Part III

Inference Internals & Production Serving

36 chapters · Chapters 21–56

Part IV

Information Retrieval & RAG

9 chapters · Chapters 57–65

Part V

Agents, Tool Use, Workflow Orchestration

7 chapters · Chapters 66–72

Part VI

Distributed Systems & Request Lifecycle

12 chapters · Chapters 73–84

Part VII

The Data Plane

7 chapters · Chapters 85–91

Part VIII

Observability, Reliability, Incidents

9 chapters · Chapters 92–100

Part IX

Build, Deploy, Operate

13 chapters · Chapters 101–113

Part X

ML System Design Interview Playbook

17 chapters · Chapters 114–130