Week 10: Evaluation of LLMs

Introduction

As LLMs become central to critical applications, evaluating their behavior goes beyond accuracy—it includes trustworthiness, relevance, coherence, fairness, and safety. This week focuses on systematically measuring how LLMs perform across different axes. You’ll explore evaluation for classification, generation, reasoning, retrieval, and chat-based interactions. We’ll cover standard NLP metrics like BLEU, ROUGE, and F1, as well as newer methods like model-based evaluation, human preference ranking, and red teaming. You’ll also learn to use tools like HELM, TruLens, and OpenAI’s Evals framework to design custom evaluations for your own applications.

Goals for the Week

  • Understand the dimensions of LLM evaluation: correctness, coherence, faithfulness, toxicity, bias, etc.
  • Apply task-specific metrics (e.g., BLEU, ROUGE, METEOR for summarization; EM/F1 for QA).
  • Explore open-source tools and frameworks for evaluating LLM outputs.
  • Learn how to conduct human-in-the-loop evaluation using rubrics and A/B testing.
  • Perform red teaming to stress-test model safety and robustness.

Learning Guide

Videos

A must-watch talk from Josh Tobin to get some insights on evaluating LLMs.

Optional Short Course:

Articles & Reading

Comprehensive Guides:

Best Practices:

Metrics Deep Dive:

Evaluation Tools & Libraries

Open Source Frameworks:

Production & Monitoring:

Benchmarking Frameworks & Leaderboards

Programming Practice

  • Use DeepEval to create custom evaluations for summarization or QA tasks:
    • Implement G-Eval or answer relevancy metrics
    • Create custom evaluation test cases
    • Integrate with pytest for automated testing
  • Evaluate generation quality using metrics34:
    • Use evaluate.load("rouge"), bleu, bertscore with Hugging Face’s evaluate package
    • Compare prompt variations and decoding strategies on factual accuracy
    • Measure semantic similarity between generated and reference answers
  • Use TruLens/LangSmith/Arize to log and monitor LLM output quality:
    • Track metrics like relevance, groundedness, and sentiment
    • Set up evaluation pipelines for continuous monitoring
    • Visualize evaluation results and trends

References


  1. Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., … & Koreeda, Y. (2022). Holistic Evaluation of Language Models. arXiv preprint. arXiv:2211.09110↩︎

  2. Chiang, W. L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., … & Stoica, I. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv preprint. arXiv:2403.04132↩︎

  3. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311-318). ACL Anthology↩︎

  4. Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations (ICLR). arXiv:1904.09675↩︎