Week 3: Introduction to Large Language Models (LLMs)

Introduction

This week is dedicated to developing a high-level understanding of how large language models like GPT generate coherent and context-aware text. Drawing inspiration from Andrej Karpathy’s lecture and Hugging Face’s LLM course, we will explore how transformers are extended into autoregressive models capable of next-token prediction. We’ll examine the full generation pipeline — from tokenization and model inference to decoding techniques like greedy search, top-k, nucleus (top‑p) sampling, and temperature scaling¹². By the end of the week, you’ll understand not just the architecture, but also how LLMs produce meaningful outputs through probabilistic sampling and why this makes them powerful tools for tasks like summarization, translation, and dialogue.

Goals for the Week:

Grasp the high-level intuition behind how LLMs convert input prompts into generated outputs.
Understand the role of tokenization, model forward pass, and decoding strategies in text generation.
Explore sampling techniques: greedy decoding, temperature scaling, top-k sampling, and nucleus (top-p) sampling.
Gain hands-on experience with LLMs using the Hugging Face transformers library³⁴⁵⁶.

References for this week will point to foundational and practical resources on transformers, tokenization, and decoding strategies (top-k, nucleus/top-p, temperature) to ground the material.

Learning Guide

Deep dive into Text Generation Inference with LLMs
- Explains the difference between decoding methods and how sampling affects outputs.
Putting it all together - The building blocks of transformer pipeline
- Understand how input, tokenizers and models fit together to generate output from end to end.

Examples

Tokenizing Inputs

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

Text Generation Pipeline

from transformers import pipeline

generator = pipeline("text-generation")

prompt = "Once upon a time, in a land far, far away,"

# Generate multiple sequences with sampling for more creative output
results = generator(
    prompt,
    max_new_tokens=50,
    num_return_sequences=3,  # Get 3 different possible continuations
    temperature=0.7,         # Controls randomness (higher = more creative)
    do_sample=True           # Enable sampling (as opposed to greedy decoding)
)

for i, result in enumerate(results):
    print(f"Result {i+1}: {result['generated_text']}\n")

Programming Practice

Use the Hugging Face pipeline(“text-generation”) with GPT-2 and explore outputs using:
- Greedy decoding
- Top-k sampling
- Top-p (nucleus) sampling
- Temperature scaling
Tokenize a prompt using AutoTokenizer and trace how tokens are fed into AutoModelForCausalLM.
Visualize token-level probabilities and logit scores to understand how LLMs rank and sample next tokens.

References

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration. Proceedings of ICLR 2020. arXiv:1904.09751. ↩︎
OpenAI – Text Generation and Decoding Guide (temperature, top‑p): https://platform.openai.com/docs/guides/text ↩︎
Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. arXiv:1910.03771. ↩︎
Hugging Face – Transformers Pipeline Tutorial: https://huggingface.co/docs/transformers/pipeline_tutorial ↩︎
Hugging Face – LLM Text Generation Tutorial: https://huggingface.co/docs/transformers/llm_tutorial ↩︎
Hugging Face – Tokenizers (concepts and APIs): https://huggingface.co/docs/tokenizers/index ↩︎