Week 9: Safety, Bias and Ethics in LLMs

Introduction

As LLMs are deployed in critical systems—from healthcare to hiring—concerns around fairness, bias, safety, and ethics become paramount. This week explores how LLMs can generate harmful content, reinforce stereotypes, or propagate misinformation, and what methods exist to identify, mitigate, and manage these risks. You’ll learn about alignment techniques (RLHF, Constitutional AI), adversarial testing (red teaming, jailbreaks), and ethical deployment considerations.

Goals for the Week

  • Understand different types of bias in LLMs: gender, racial, cultural, linguistic, and geographic.
  • Learn about safety risks like prompt injection, jailbreaks, hallucination, and toxic generation.
  • Explore alignment techniques including RLHF, Constitutional AI, and red teaming.
  • Analyze ethical questions around model deployment, attribution, labor, and dual-use.
  • Implement safety measures and conduct structured adversarial testing.

Learning Guide

Videos

A comprehensive overview of AI safety challenges and mitigation strategies.

Optional Short Courses:

Articles & Reading

Comprehensive Guides:

Foundational Papers:

Safety & Alignment:

Security & Practical Guides:

Safety Tools & Libraries

Toxicity & Moderation:

Bias & Fairness:

Red Teaming & Security:

  • PyRIT — Microsoft’s Python Risk Identification Toolkit for generative AI
  • Garak — LLM vulnerability scanner
  • PromptInject — Prompt injection detection framework

Benchmarks & Datasets:

Alignment & Safety Research

Key Organizations:

Frameworks & Documentation:

Programming Practice

  • Toxicity Detection & Mitigation:

    • Evaluate outputs using Perspective API or Detoxify with configurable thresholds
    • Compare multiple classifiers and build a monitoring dashboard
  • Red Teaming & Adversarial Testing:

    • Test your system for prompt injections, jailbreaks, and instruction-following failures
    • Use Garak or PyRIT for automated testing; log and classify failures by severity
    • Create domain-specific adversarial prompts and implement defenses
  • Bias Detection & Fairness:

    • Build a dashboard comparing outputs across demographic variations (e.g., “She is a doctor” vs “He is a doctor”)
    • Benchmark on BBQ or BOLD datasets; implement counterfactual augmentation
  • Safety Guardrails:

    • Use Guardrails AI for toxicity, PII, and factuality validation
    • Implement moderation layers with rejection sampling and fallback responses
  • Alignment Evaluation:

    • Test on TruthfulQA and RealToxicityPrompts
    • Create a multi-dimensional safety scorecard (toxicity, bias, truthfulness, robustness)

References


  1. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. FAccT 2021. DOI:10.1145/3442188.3445922↩︎

  2. Buolamwini, J., & Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. FAT 2018. PMLR↩︎

  3. Borkan, D., Dixon, L., Sorensen, J., Thain, N., & Vasserman, L. (2019). Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification. WWW 2019. arXiv:1903.04561↩︎

  4. Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. ACL 2020. arXiv:2005.14050↩︎

  5. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073↩︎

  6. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., … & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. arXiv:2203.02155↩︎

  7. Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., … & Clark, J. (2022). Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv:2209.07858↩︎

  8. Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043↩︎