Week 11: Safety, Bias and Ethics in LLMs

Introduction

As LLMs are deployed in critical systems—from healthcare to hiring—concerns around fairness, bias, safety, and ethics become paramount. This week explores how LLMs can generate harmful content, reinforce stereotypes, or propagate misinformation, and what methods exist to identify, mitigate, and manage these risks. You’ll learn about alignment techniques (RLHF, Constitutional AI), adversarial testing (red teaming, jailbreaks), and ethical deployment considerations.

Goals for the Week

Understand different types of bias in LLMs: gender, racial, cultural, linguistic, and geographic.
Learn about safety risks like prompt injection, jailbreaks, hallucination, and toxic generation.
Explore alignment techniques including RLHF, Constitutional AI, and red teaming.
Analyze ethical questions around model deployment, attribution, labor, and dual-use.
Implement safety measures and conduct structured adversarial testing.

Learning Guide

Videos

A comprehensive overview of AI safety challenges and mitigation strategies.

Optional Short Courses:

Red Teaming LLM Applications — DeepLearning.AI course on adversarial testing
Quality and Safety for LLM Applications — Practical safety techniques for production systems

Articles & Reading

Comprehensive Guides:

OpenAI: Safety Best Practices — Production safety guidelines
Anthropic: Core Views on AI Safety — Research-driven safety perspective
Microsoft: Responsible AI — Enterprise framework for responsible deployment

Foundational Papers:

On the Dangers of Stochastic Parrots¹ — Seminal work on LLM risks and harms
Gender Shades² — Intersectional bias in commercial AI systems
Measuring and Mitigating Unintended Bias³ — Bias detection and mitigation techniques
Language (Technology) is Power: Survey of Bias in NLP⁴ — Comprehensive bias survey

Safety & Alignment:

Constitutional AI⁵ — Training harmless assistants via AI feedback
RLHF: Training from Human Preferences⁶ — Foundation of alignment methods
Red Teaming Language Models⁷ — Methods for discovering harms
Jailbreaking Aligned Models⁸ — Adversarial attacks on safety guardrails

Security & Practical Guides:

Prompt Hacking Guide — Comprehensive adversarial techniques
Adversarial Prompting — Attack vectors and defenses
AI Ethics Resources — Fast.ai’s ethics curriculum
OWASP Top 10 for LLM Applications — Common vulnerability taxonomy and mitigations

Safety Tools & Libraries

Toxicity & Moderation:

Perspective API — Google’s toxicity detection | Docs
Detoxify — Open-source toxicity classifiers
OpenAI Moderation API — Content policy compliance
Guardrails AI — Validate and correct LLM outputs
Microsoft Presidio — PII detection and redaction for text and structured data

Bias & Fairness:

AI Fairness 360 — IBM’s comprehensive toolkit
Fairlearn — Fairness toolkit | GitHub
HolisticBias — Meta’s bias measurement dataset

Red Teaming & Security:

PyRIT — Microsoft’s Python Risk Identification Toolkit for generative AI
Garak — LLM vulnerability scanner
PromptInject — Prompt injection detection framework

Benchmarks & Datasets:

TruthfulQA — Measuring truthfulness in LMs
RealToxicityPrompts — Toxic generation measurement
BBQ — Bias Benchmark for QA
BOLD — Bias in Open-Ended Generation

Alignment & Safety Research

Key Organizations:

Frameworks & Documentation:

NIST AI Risk Management Framework — U.S. government AI risk framework
EU AI Act — European risk-based regulatory framework
Model Cards & System Cards — Transparency documentation standards

Programming Practice

Toxicity Detection & Mitigation:
- Evaluate outputs using Perspective API or Detoxify with configurable thresholds
- Compare multiple classifiers and build a monitoring dashboard
Red Teaming & Adversarial Testing:
- Test your system for prompt injections, jailbreaks, and instruction-following failures
- Use Garak or PyRIT for automated testing; log and classify failures by severity
- Create domain-specific adversarial prompts and implement defenses
Bias Detection & Fairness:
- Build a dashboard comparing outputs across demographic variations (e.g., “She is a doctor” vs “He is a doctor”)
- Benchmark on BBQ or BOLD datasets; implement counterfactual augmentation
Safety Guardrails:
- Use Guardrails AI for toxicity, PII, and factuality validation
- Implement moderation layers with rejection sampling and fallback responses
Alignment Evaluation:
- Test on TruthfulQA and RealToxicityPrompts
- Create a multi-dimensional safety scorecard (toxicity, bias, truthfulness, robustness)

References

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. FAccT 2021. DOI:10.1145/3442188.3445922. ↩︎
Buolamwini, J., & Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. FAT 2018. PMLR. ↩︎
Borkan, D., Dixon, L., Sorensen, J., Thain, N., & Vasserman, L. (2019). Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification. WWW 2019. arXiv:1903.04561. ↩︎
Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. ACL 2020. arXiv:2005.14050. ↩︎
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. ↩︎
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., … & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. arXiv:2203.02155. ↩︎
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., … & Clark, J. (2022). Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv:2209.07858. ↩︎
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043. ↩︎