Week 2: Attention & Transformers

transformer

Introduction

Transformers are the foundational architecture behind most modern large language models, including BERT, GPT, and T5. Introduced in the groundbreaking paper “Attention Is All You Need”1, transformers leverage self-attention mechanisms to model dependencies across sequences, enabling scalable and parallel training. In this week, we will explore how transformers process input tokens, how attention mechanisms function internally, and how architectures evolve when scaling to billions of parameters. This foundational knowledge is critical for understanding how LLMs generate, reason, and attend to context.

Goals for the Week:

  • Understand the core components of the Transformer architecture: embeddings, multi-head self-attention, positional encoding, feed-forward layers, and layer normalization.
  • Trace the flow of information through a Transformer block.
  • Analyze architectural variants like, BERT2 and GPT3 to see how they adapt the core design.
  • Get familiar with HuggingFace transformers library.

Learning Guide

Programming Practice

  • Use HuggingFace transformer pipeline to demonstrate and solve at-least two of the following tasks. Use a dataset of your choice, and submit your work in a notebook form. Include evaluation metrics to measure the performance of your tasks.
    • Image Classification
    • Speech Recognition
    • Text Classification
    • Text Generation
    • Translation
    • Any other use-case of your choice.

Examples

from transformers import pipeline
transcriber = pipeline(
    task="automatic-speech-recognition", model="openai/whisper-base.en"
)
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
# Output: {'text': ' I have a dream that one day this nation will rise up and live out the true meaning of its creed.'}

References


  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. ↩︎

  2. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186). ↩︎

  3. Radford, A., & Narasimhan, K. (2018). Improving Language Understanding by Generative Pre-Training. ↩︎