Week 13: Scaling & Cost-Efficiency

Introduction

As LLM-powered systems transition from research to production, efficient scaling becomes critical. This week explores the mathematical foundations of scaling laws, optimization techniques for inference (batching, caching, quantization), and cost-performance trade-offs that enable real-world deployment. Understanding these principles is essential for building systems that are both powerful and economically viable.

Goals for the Week

  • Understand scaling laws that govern LLM performance, compute, and data requirements
  • Master inference optimization techniques: batching, KV caching, and quantization
  • Apply model compression strategies including distillation and parameter-efficient methods
  • Evaluate cost-performance trade-offs across model selection and deployment strategies
  • Build production-ready systems that balance quality, latency, and cost

Learning Guide

Videos

Short Courses:

Articles & Reading

Scaling Laws & Foundations:

Inference Optimization:

Quantization & Compression:

Cost Optimization:

Tools & Frameworks

Inference Engines:

Quantization Tools:

  • bitsandbytes — 4-bit and 8-bit quantization integration
  • GPTQ — Post-training quantization implementation
  • AutoGPTQ — Easy-to-use GPTQ wrapper
  • AutoAWQ — Activation-aware quantization toolkit

Multi-LoRA Serving:

Commercial Platforms:

Programming Practice

Batch Processing & Optimization:

  • Implement OpenAI Batch API workflows:
    • Send multiple prompts asynchronously
    • Measure cost savings (50% reduction) vs real-time API
    • Compare throughput and latency trade-offs
  • Batch Processing Example — OpenAI cookbook implementation
  • UPenn Batch API Tutorial — Academic workflow guide

Quantization Experiments:

  • Quantize models using bitsandbytes:
    • Compare 4-bit, 8-bit, and FP16 performance
    • Benchmark memory usage, latency, and accuracy
    • Implement QLoRA for efficient fine-tuning
  • Quantization with Transformers — HuggingFace Accelerate guide
  • 4-bit Transformers Tutorial — Detailed implementation walkthrough

KV Cache Implementation:

  • Build KV caching from scratch:
    • Implement attention with cached key-value pairs
    • Measure throughput improvements on repeated prompts
    • Experiment with cache eviction strategies
  • Deploy with Redis or in-memory caching for production workflows

Inference Optimization:

  • Deploy vLLM or TGI for production serving:
    • Configure continuous batching parameters
    • Benchmark requests/second and latency
    • Implement multi-LoRA serving for task-specific adapters
  • Compare greedy decoding, beam search, and sampling strategies

Cost-Quality Analysis:

  • Implement FrugalGPT-style cascading:
    • Route queries to cheaper models first
    • Escalate to larger models only when needed
    • Measure cost reduction vs accuracy trade-off
  • Analyze token usage optimization through prompt compression

References


  1. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361↩︎

  2. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … & Sifre, L. (2022). Training Compute-Optimal Large Language Models. NeurIPS 2022. arXiv:2203.15556↩︎

  3. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., … & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. arXiv:2309.06180↩︎

  4. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323↩︎

  5. Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978↩︎

  6. Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS 2023. arXiv:2305.14314↩︎

  7. Chen, L., Zaharia, M., & Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176↩︎

  8. Cheng, D., Huang, S., Bi, W., Zhuang, Y., Jiang, S., Jiao, Y., … & Wei, F. (2023). Batch Prompting: Efficient Inference with Large Language Model APIs. arXiv:2310.03094↩︎

Prev