Week 10: Large Multimodal Models (LMMs)

Introduction

While traditional LLMs operate purely on text, the next frontier in AI is multimodal modeling—systems that understand and generate across multiple data types such as text, images, video, and audio. Models like GPT-4V, Gemini, CLIP, and LLaVA combine visual perception with language understanding, enabling powerful capabilities including image captioning, document understanding, visual question answering (VQA), and even code generation from screenshots.

Goals for the Week

  • Understand the architectures behind vision-language and multimodal LLMs
  • Learn about cross-modal embedding alignment and fusion techniques (CLIP, ALIGN, contrastive learning)
  • Build systems that generate text from images and query visual content using natural language
  • Explore real-world applications: VQA, captioning, OCR, visual grounding, and multimodal RAG
  • Evaluate multimodal model outputs for relevance, consistency, and hallucination

Learning Guide

Videos

Optional Short Courses:

Articles & Reading

Comprehensive Guides:

Additional Reading

Foundational Papers:

Evaluation & Benchmarks:

Models & Tools

Commercial APIs:

Open-Source Models:

  • LLaVA — Visual instruction tuning | GitHub
  • BLIP-2 — Efficient vision-language pre-training
  • Qwen-VL — Multilingual multimodal understanding
  • CogVLM — Open visual expert language model
  • Fuyu-8B — Simplified multimodal architecture

Frameworks & Libraries:

Evaluation Tools:

Applications & Use Cases

Document Understanding:

Visual Reasoning:

Video Understanding:

Programming Practice

Vision-Language Inference:

  • Use OpenAI GPT-4V or Gemini to:
    • Generate descriptions of diagrams, charts, and UI mockups
    • Answer questions about images (VQA)
    • Extract structured data from documents (receipts, forms)
    • Solve visual puzzles and reasoning tasks

Open-Source Model Deployment:

  • Run inference with LLaVA or BLIP-2:
    • Caption images from diverse domains
    • Rank image-text relevance pairs
    • Build zero-shot image classifier using CLIP
    • Implement visual search over image gallery

Multimodal RAG System:

  • Build a document Q&A system:
    • Embed images and text using CLIP or multimodal embeddings
    • Store in vector database (Weaviate, Qdrant, Chroma)
    • Retrieve relevant images based on text queries
    • Generate answers combining visual and textual context

Evaluation & Testing:

  • Benchmark on VQAv2 or MMBench datasets
  • Measure hallucination using POPE
  • Calculate CLIP Score for image-text alignment
  • Create custom test sets for domain-specific applications
  • Implement safety checks for visual content (NSFW, bias)

References


  1. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. arXiv:2103.00020↩︎

  2. Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., … & Simonyan, K. (2022). Flamingo: A Visual Language Model for Few-Shot Learning. NeurIPS 2022. arXiv:2204.14198↩︎

  3. Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023. arXiv:2301.12597↩︎

  4. Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. NeurIPS 2023. arXiv:2304.08485↩︎

  5. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W. X., & Wen, J. R. (2024). Evaluating Object Hallucination in Large Vision-Language Models. EACL 2024. arXiv:2402.15721↩︎