Week 2: Attention & Transformers

Introduction

Transformers are the foundational architecture behind most modern large language models, including BERT, GPT, and T5. Introduced in the groundbreaking paper “Attention Is All You Need”¹, transformers leverage self-attention mechanisms to model dependencies across sequences, enabling scalable and parallel training. In this week, we will explore how transformers process input tokens, how attention mechanisms function internally, and how architectures evolve when scaling to billions of parameters. This foundational knowledge is critical for understanding how LLMs generate, reason, and attend to context.

Goals for the Week:

Understand the core components of the Transformer architecture: embeddings, multi-head self-attention, positional encoding, feed-forward layers, and layer normalization.
Trace the flow of information through a Transformer block.
Analyze architectural variants like, BERT² and GPT³ to see how they adapt the core design.
Get familiar with HuggingFace transformers library.

Learning Guide

Videos

Readings

Beginner Friendly
Essential Readings
- The Illustrated Transformer
Going Deeper
Visualization and Implementation
- Transformer Explainer - Interactive Visualization
- The Annotated Transformer

Programming Practice

Use HuggingFace transformer pipeline to demonstrate and solve at-least two of the following tasks. Use a dataset of your choice, and submit your work in a notebook form. Include evaluation metrics to measure the performance of your tasks.
- Image Classification
- Speech Recognition
- Text Classification
- Text Generation
- Translation
- Any other use-case of your choice.

Examples

from transformers import pipeline
transcriber = pipeline(
    task="automatic-speech-recognition", model="openai/whisper-base.en"
)
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
# Output: {'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

Notebook: Using Transformer Pipeline

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. ↩︎
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186). ↩︎
Radford, A., & Narasimhan, K. (2018). Improving Language Understanding by Generative Pre-Training. ↩︎