Week 10: Evaluation of LLMs

Introduction

As LLMs become central to critical applications, evaluating their behavior goes beyond accuracy—it includes trustworthiness, relevance, coherence, fairness, and safety. This week focuses on systematically measuring how LLMs perform across different axes. You’ll explore evaluation for classification, generation, reasoning, retrieval, and chat-based interactions. We’ll cover standard NLP metrics like BLEU, ROUGE, and F1, as well as newer methods like model-based evaluation, human preference ranking, and red teaming. You’ll also learn to use tools like HELM, TruLens, and OpenAI’s Evals framework to design custom evaluations for your own applications.

Goals for the Week

Understand the dimensions of LLM evaluation: correctness, coherence, faithfulness, toxicity, bias, etc.
Apply task-specific metrics (e.g., BLEU, ROUGE, METEOR for summarization; EM/F1 for QA).
Explore open-source tools and frameworks for evaluating LLM outputs.
Learn how to conduct human-in-the-loop evaluation using rubrics and A/B testing.
Perform red teaming to stress-test model safety and robustness.

Learning Guide

Videos

A must-watch talk from Josh Tobin to get some insights on evaluating LLMs.

Optional Short Course:

Evaluating AI Agents — DeepLearning.AI course on evaluation frameworks and metrics

Articles & Reading

Comprehensive Guides:

HF Evaluation Guidebook — A comprehensive resource for understanding all things LLM evaluation.
- Hugging Face LLM Evaluation Guide — Short article
LLM Evaluation: 4 Approaches by Sebastian Raschka

Best Practices:

Evaluations in Practice by Hamel Husain — Real-world evaluation strategies
Microsoft: Evaluating LLM Systems — Metrics, challenges, and best practices
Databricks: LLM Evaluation Methods — Production-focused approaches

Metrics Deep Dive:

Confident AI: LLM Evaluation Metrics — Comprehensive metrics guide
Confident AI: AI Agent Evaluation — Testing AI Agents

Evaluation Tools & Libraries

Open Source Frameworks:

Hugging Face Evaluate — Metrics library with easy Transformers integration
DeepEval — Open-source evaluation toolkit with pre-built metrics
OpenAI Evals — Framework for custom task evaluation

Production & Monitoring:

Arize LLM Evaluation — Production-grade monitoring and evaluation
- Arize AI GitHub
TruLens — Evaluation and tracking for LLM applications
- TruLens GitHub
LangSmith — Evaluate LangChain applications
- LangChain + OpenEvals

Benchmarking Frameworks & Leaderboards

HELM (Holistic Evaluation of Language Models)¹ — Stanford’s comprehensive benchmarking framework
LMSys Chatbot Arena² — Compare two chatbots side-by-side with crowdsourced evaluation
Open LLM Leaderboard — Hugging Face’s community rankings
OpenLM Chatbot Arena — Community-driven model rankings

Programming Practice

Use DeepEval to create custom evaluations for summarization or QA tasks:
- Implement G-Eval or answer relevancy metrics
- Create custom evaluation test cases
- Integrate with pytest for automated testing
Evaluate generation quality using metrics³⁴:
- Use evaluate.load("rouge"), bleu, bertscore with Hugging Face’s evaluate package
- Compare prompt variations and decoding strategies on factual accuracy
- Measure semantic similarity between generated and reference answers
Use TruLens/LangSmith/Arize to log and monitor LLM output quality:
- Track metrics like relevance, groundedness, and sentiment
- Set up evaluation pipelines for continuous monitoring
- Visualize evaluation results and trends

References

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., … & Koreeda, Y. (2022). Holistic Evaluation of Language Models. arXiv preprint. arXiv:2211.09110. ↩︎
Chiang, W. L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., … & Stoica, I. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv preprint. arXiv:2403.04132. ↩︎
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311-318). ACL Anthology. ↩︎
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations (ICLR). arXiv:1904.09675. ↩︎