Week 12: LLM APIs & Deployment

Introduction

This week focuses on integrating LLMs into production-grade applications using APIs, SDKs, and frameworks like LangChain. You’ll learn how to call LLM APIs (OpenAI, Anthropic, Cohere, Hugging Face) through LangChain abstractions, manage streaming/batching, handle rate limits and costs, and deploy backends using Flask, FastAPI, or Streamlit. You’ll also explore serving your own models with Ollama, vLLM, or Text Generation Inference.

Goals for the Week

  • Integrate LLM APIs into web apps and backends using LangChain (sync/async, streaming, batching)
  • Handle rate limits (TPM/RPM), implement retry logic, and optimize costs
  • Deploy LLM services using serverless, containers, or managed inference
  • Serve open-source models with Ollama, TGI, or vLLM through LangChain
  • Implement best practices for security, monitoring, and observability with LangSmith

Learning Guide

Core Readings

Web Frameworks

  • Streamlit: Quick demos, internal tools, data apps with LangChain integration
  • FastAPI: Production REST APIs, async/await, high-performance backends
  • LangServe: Deploy LangChain chains as REST APIs with automatic OpenAPI docs
  • Full-Stack (Flask + React): Complete web applications

Model Serving

  • Ollama: Local development, simple CLI/API, OpenAI-compatible
  • HuggingFace TGI: Production serving, tensor parallelism, quantization, flash attention
  • vLLM: High-throughput with PagedAttention, continuous batching, OpenAI-compatible

Provider SDKs & LangChain Integrations

Provider SDK LangChain Package Key Features
OpenAI openai langchain-openai Chat completions, streaming, function calling
Anthropic anthropic langchain-anthropic Extended context, constitutional AI
Cohere cohere langchain-cohere Enterprise search, embeddings, reranking
Hugging Face huggingface_hub langchain-huggingface 100k+ models, inference API, versioning
Ollama API langchain-community Local models, OpenAI-compatible
Google AI google-generativeai langchain-google-genai Multimodal, grounding with search

Examples

1. Streamlit Chatbot App

Build an interactive chatbot with streaming, session history, and token tracking using LangChain.

import streamlit as st
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage

st.title("๐Ÿ’ฌ LLM Chatbot")

# Initialize LangChain ChatOpenAI
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.7,
    streaming=True,
    openai_api_key=st.secrets["OPENAI_API_KEY"]
)

if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Handle user input
if prompt := st.chat_input("What would you like to know?"):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)
    
    # Convert to LangChain message format
    lc_messages = [
        HumanMessage(content=m["content"]) if m["role"] == "user" 
        else AIMessage(content=m["content"])
        for m in st.session_state.messages
    ]
    
    # Stream response
    with st.chat_message("assistant"):
        response = st.write_stream(llm.stream(lc_messages))
    
    st.session_state.messages.append({"role": "assistant", "content": response})

Bonus: Add model selection, temperature slider, cost estimation with LangChain callbacks

2. FastAPI Production Backend

Create REST API with /generate, /chat, /embeddings endpoints using LangChain with async/await, Pydantic validation, rate limiting, error handling.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser
import os

app = FastAPI(title="LLM API")

# Initialize LangChain LLM
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    openai_api_key=os.getenv("OPENAI_API_KEY")
)
parser = StrOutputParser()

class ChatRequest(BaseModel):
    message: str
    system_prompt: str = "You are a helpful assistant."
    model: str = "gpt-3.5-turbo"
    temperature: float = 0.7

@app.post("/chat")
async def chat(request: ChatRequest):
    try:
        # Create LangChain chain
        messages = [
            SystemMessage(content=request.system_prompt),
            HumanMessage(content=request.message)
        ]
        
        # Use LangChain with custom temperature
        custom_llm = ChatOpenAI(
            model=request.model,
            temperature=request.temperature,
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )
        
        chain = custom_llm | parser
        response = await chain.ainvoke(messages)
        
        return {"response": response}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 100):
    try:
        chain = llm | parser
        response = await chain.ainvoke([HumanMessage(content=prompt)])
        return {"text": response}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Bonus: Add API key authentication, LangSmith tracing for monitoring

3. Self-Hosted Model Serving

Deploy Llama 3/Mistral/Phi-3 with Ollama and connect via LangChain, create Streamlit/Gradio frontend.

# Ollama setup
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3
ollama serve
# Using LangChain with Ollama
from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Initialize Ollama through LangChain
llm = Ollama(model="llama3", base_url="http://localhost:11434")

# Create a simple chain
prompt = ChatPromptTemplate.from_template("Answer this question: {question}")
chain = prompt | llm | StrOutputParser()

# Use the chain
response = chain.invoke({"question": "Why is the sky blue?"})
print(response)

Alternative with ChatOllama for chat models:

from langchain_community.chat_models import ChatOllama
from langchain_core.messages import HumanMessage

chat_model = ChatOllama(model="llama3")
response = chat_model.invoke([HumanMessage(content="Why is the sky blue?")])
print(response.content)

Bonus: Deploy to cloud (AWS EC2, GCP, Modal) with LangServe for production serving

Deployment Platforms

Platform Best For Pros Cons
Hugging Face Spaces Quick demos, ML prototypes Free tier, GPU support Limited customization
Streamlit Cloud Data apps, internal tools Free for public apps Streamlit-only
Vercel Next.js apps, serverless Edge network, excellent DX Cold starts, time limits
Railway Full-stack apps, databases Simple config, persistent storage Pricing can scale
Modal GPU workloads, ML serving Pay-per-use, fast cold starts Newer platform
AWS / GCP / Azure Enterprise, full control Scalable, comprehensive Complex setup, cost
Render Web services, Docker Auto-deploy from Git Less flexible than AWS