Introduction
As LLMs are deployed in critical systems—from healthcare to hiring—concerns around fairness, bias, safety, and ethics become paramount. This week explores how LLMs can generate harmful content, reinforce stereotypes, or propagate misinformation, and what methods exist to identify, mitigate, and manage these risks. You’ll learn about alignment techniques (RLHF, Constitutional AI), adversarial testing (red teaming, jailbreaks), and ethical deployment considerations.
Goals for the Week
- Understand different types of bias in LLMs: gender, racial, cultural, linguistic, and geographic.
- Learn about safety risks like prompt injection, jailbreaks, hallucination, and toxic generation.
- Explore alignment techniques including RLHF, Constitutional AI, and red teaming.
- Analyze ethical questions around model deployment, attribution, labor, and dual-use.
- Implement safety measures and conduct structured adversarial testing.
Learning Guide
Videos
A comprehensive overview of AI safety challenges and mitigation strategies.
Optional Short Courses:
- Red Teaming LLM Applications — DeepLearning.AI course on adversarial testing
- Quality and Safety for LLM Applications — Practical safety techniques for production systems
Articles & Reading
Comprehensive Guides:
- OpenAI: Safety Best Practices — Production safety guidelines
- Anthropic: Core Views on AI Safety — Research-driven safety perspective
- Microsoft: Responsible AI — Enterprise framework for responsible deployment
Foundational Papers:
- On the Dangers of Stochastic Parrots1 — Seminal work on LLM risks and harms
- Gender Shades2 — Intersectional bias in commercial AI systems
- Measuring and Mitigating Unintended Bias3 — Bias detection and mitigation techniques
- Language (Technology) is Power: Survey of Bias in NLP4 — Comprehensive bias survey
Safety & Alignment:
- Constitutional AI5 — Training harmless assistants via AI feedback
- RLHF: Training from Human Preferences6 — Foundation of alignment methods
- Red Teaming Language Models7 — Methods for discovering harms
- Jailbreaking Aligned Models8 — Adversarial attacks on safety guardrails
Security & Practical Guides:
- Prompt Hacking Guide — Comprehensive adversarial techniques
- Adversarial Prompting — Attack vectors and defenses
- AI Ethics Resources — Fast.ai’s ethics curriculum
- OWASP Top 10 for LLM Applications — Common vulnerability taxonomy and mitigations
Safety Tools & Libraries
Toxicity & Moderation:
- Perspective API — Google’s toxicity detection | Docs
- Detoxify — Open-source toxicity classifiers
- OpenAI Moderation API — Content policy compliance
- Guardrails AI — Validate and correct LLM outputs
- Microsoft Presidio — PII detection and redaction for text and structured data
Bias & Fairness:
- AI Fairness 360 — IBM’s comprehensive toolkit
- Fairlearn — Fairness toolkit | GitHub
- HolisticBias — Meta’s bias measurement dataset
Red Teaming & Security:
- PyRIT — Microsoft’s Python Risk Identification Toolkit for generative AI
- Garak — LLM vulnerability scanner
- PromptInject — Prompt injection detection framework
Benchmarks & Datasets:
- TruthfulQA — Measuring truthfulness in LMs
- RealToxicityPrompts — Toxic generation measurement
- BBQ — Bias Benchmark for QA
- BOLD — Bias in Open-Ended Generation
Alignment & Safety Research
Key Organizations:
Frameworks & Documentation:
- NIST AI Risk Management Framework — U.S. government AI risk framework
- EU AI Act — European risk-based regulatory framework
- Model Cards & System Cards — Transparency documentation standards
Programming Practice
-
Toxicity Detection & Mitigation:
- Evaluate outputs using Perspective API or Detoxify with configurable thresholds
- Compare multiple classifiers and build a monitoring dashboard
-
Red Teaming & Adversarial Testing:
- Test your system for prompt injections, jailbreaks, and instruction-following failures
- Use Garak or PyRIT for automated testing; log and classify failures by severity
- Create domain-specific adversarial prompts and implement defenses
-
Bias Detection & Fairness:
- Build a dashboard comparing outputs across demographic variations (e.g., “She is a doctor” vs “He is a doctor”)
- Benchmark on BBQ or BOLD datasets; implement counterfactual augmentation
-
Safety Guardrails:
- Use Guardrails AI for toxicity, PII, and factuality validation
- Implement moderation layers with rejection sampling and fallback responses
-
Alignment Evaluation:
- Test on TruthfulQA and RealToxicityPrompts
- Create a multi-dimensional safety scorecard (toxicity, bias, truthfulness, robustness)
References
-
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. FAccT 2021. DOI:10.1145/3442188.3445922. ↩︎
-
Buolamwini, J., & Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. FAT 2018. PMLR. ↩︎
-
Borkan, D., Dixon, L., Sorensen, J., Thain, N., & Vasserman, L. (2019). Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification. WWW 2019. arXiv:1903.04561. ↩︎
-
Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. ACL 2020. arXiv:2005.14050. ↩︎
-
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. ↩︎
-
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., … & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. NeurIPS 2022. arXiv:2203.02155. ↩︎
-
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., … & Clark, J. (2022). Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv:2209.07858. ↩︎
-
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043. ↩︎