Week 13: Scaling & Cost-Efficiency

Introduction

As LLM-powered systems transition from research to production, efficient scaling becomes critical. This week explores the mathematical foundations of scaling laws, optimization techniques for inference (batching, caching, quantization), and cost-performance trade-offs that enable real-world deployment. Understanding these principles is essential for building systems that are both powerful and economically viable.

Goals for the Week

Understand scaling laws that govern LLM performance, compute, and data requirements
Master inference optimization techniques: batching, KV caching, and quantization
Apply model compression strategies including distillation and parameter-efficient methods
Evaluate cost-performance trade-offs across model selection and deployment strategies
Build production-ready systems that balance quality, latency, and cost

Learning Guide

Videos

Short Courses:

Efficiently Serving LLMs — DeepLearning.AI course by Travis Addair on production serving strategies
Quantization Fundamentals — HuggingFace guide to model quantization techniques

Articles & Reading

Scaling Laws & Foundations:

Scaling Laws for Neural Language Models (OpenAI)¹ — Foundational work on compute-performance relationships
LLM Scaling Laws Explained — Cameron Wolfe’s comprehensive overview
Chinchilla Scaling Laws² — Optimal compute-data trade-offs
Understanding LLM Scalability Challenges — Practical deployment considerations

Inference Optimization:

KV Caching Explained — HuggingFace guide to key-value caching
Coding the KV Cache from Scratch — Sebastian Raschka’s implementation tutorial
Continuous Batching for LLM Serving — Dynamic batching strategies
PagedAttention: Efficient Memory Management³ — vLLM’s memory optimization technique

Quantization & Compression:

Quantization Concepts Guide — HuggingFace comprehensive overview
GPTQ: Post-Training Quantization⁴ — Efficient 4-bit quantization method
AWQ: Activation-aware Weight Quantization⁵ — Improved quantization with minimal accuracy loss
QLoRA: Efficient Fine-tuning⁶ — 4-bit quantized low-rank adaptation

Cost Optimization:

FrugalGPT: Cost Reduction Strategies⁷ — LLM cascades and model selection
FrugalGPT Implementation Guide — Practical application overview
Benchmarking LLM Inference Costs — NVIDIA’s deployment analysis
Batch Prompting for Efficient Inference⁸ — Reducing API costs through batching

Tools & Frameworks

Inference Engines:

vLLM — High-throughput serving with PagedAttention and continuous batching
Text Generation Inference (TGI) — HuggingFace production serving
TensorRT-LLM — NVIDIA’s optimized inference library
llama.cpp — Efficient CPU/GPU inference for Llama models

Quantization Tools:

bitsandbytes — 4-bit and 8-bit quantization integration
GPTQ — Post-training quantization implementation
AutoGPTQ — Easy-to-use GPTQ wrapper
AutoAWQ — Activation-aware quantization toolkit

Multi-LoRA Serving:

LoRAX — Dynamic LoRA adapter serving
vLLM LoRA Support — Multi-adapter inference
TGI Multi-LoRA — HuggingFace adapter serving
Triton LoRA Backend — Enterprise LoRA deployment

Commercial Platforms:

OpenAI Batch API — Asynchronous batch processing with 50% cost reduction
Anthropic Claude — Prompt caching and batch processing
Google Gemini — Batch inference with Gemini

Programming Practice

Batch Processing & Optimization:

Implement OpenAI Batch API workflows:
- Send multiple prompts asynchronously
- Measure cost savings (50% reduction) vs real-time API
- Compare throughput and latency trade-offs
Batch Processing Example — OpenAI cookbook implementation
UPenn Batch API Tutorial — Academic workflow guide

Quantization Experiments:

Quantize models using bitsandbytes:
- Compare 4-bit, 8-bit, and FP16 performance
- Benchmark memory usage, latency, and accuracy
- Implement QLoRA for efficient fine-tuning
Quantization with Transformers — HuggingFace Accelerate guide
4-bit Transformers Tutorial — Detailed implementation walkthrough

KV Cache Implementation:

Build KV caching from scratch:
- Implement attention with cached key-value pairs
- Measure throughput improvements on repeated prompts
- Experiment with cache eviction strategies
Deploy with Redis or in-memory caching for production workflows

Inference Optimization:

Deploy vLLM or TGI for production serving:
- Configure continuous batching parameters
- Benchmark requests/second and latency
- Implement multi-LoRA serving for task-specific adapters
Compare greedy decoding, beam search, and sampling strategies

Cost-Quality Analysis:

Implement FrugalGPT-style cascading:
- Route queries to cheaper models first
- Escalate to larger models only when needed
- Measure cost reduction vs accuracy trade-off
Analyze token usage optimization through prompt compression

References

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. ↩︎
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … & Sifre, L. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022. arXiv:2203.15556. ↩︎
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., … & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. arXiv:2309.06180. ↩︎
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323. ↩︎
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978. ↩︎
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314. ↩︎
Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176. ↩︎
Cheng, D., Huang, S., Bi, W., Zhuang, Y., Jiang, S., Jiao, Y., … & Wei, F. (2023). Batch Prompting: Efficient Inference with Large Language Model APIs. arXiv:2310.03094. ↩︎