Week 12: LLM APIs & Deployment

Introduction

This week focuses on integrating LLMs into production-grade applications using APIs, SDKs, and frameworks like LangChain. You’ll learn how to call LLM APIs (OpenAI, Anthropic, Cohere, Hugging Face) through LangChain abstractions, manage streaming/batching, handle rate limits and costs, and deploy backends using Flask, FastAPI, or Streamlit. You’ll also explore serving your own models with Ollama, vLLM, or Text Generation Inference.

Goals for the Week

Integrate LLM APIs into web apps and backends using LangChain (sync/async, streaming, batching)
Handle rate limits (TPM/RPM), implement retry logic, and optimize costs
Deploy LLM services using serverless, containers, or managed inference
Serve open-source models with Ollama, TGI, or vLLM through LangChain
Implement best practices for security, monitoring, and observability with LangSmith

Learning Guide

Core Readings

10 ways to serve LLMs: Comprehensive guide covering deployment from cloud APIs to self-hosted solutions
HuggingFace Inference Providers: Overview of Inference API, Endpoints, and deployment options
LangChain Documentation: Framework for building LLM applications with chains, agents, and integrations

Web Frameworks

Streamlit: Quick demos, internal tools, data apps with LangChain integration
FastAPI: Production REST APIs, async/await, high-performance backends
LangServe: Deploy LangChain chains as REST APIs with automatic OpenAPI docs
Full-Stack (Flask + React): Complete web applications

Model Serving

Ollama: Local development, simple CLI/API, OpenAI-compatible
HuggingFace TGI: Production serving, tensor parallelism, quantization, flash attention
vLLM: High-throughput with PagedAttention, continuous batching, OpenAI-compatible

Provider SDKs & LangChain Integrations

Provider	SDK	LangChain Package	Key Features
OpenAI	`openai`	`langchain-openai`	Chat completions, streaming, function calling
Anthropic	`anthropic`	`langchain-anthropic`	Extended context, constitutional AI
Cohere	`cohere`	`langchain-cohere`	Enterprise search, embeddings, reranking
Hugging Face	`huggingface_hub`	`langchain-huggingface`	100k+ models, inference API, versioning
Ollama	API	`langchain-community`	Local models, OpenAI-compatible
Google AI	`google-generativeai`	`langchain-google-genai`	Multimodal, grounding with search

Examples

1. Streamlit Chatbot App

Build an interactive chatbot with streaming, session history, and token tracking using LangChain.

import streamlit as st
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage

st.title("💬 LLM Chatbot")

# Initialize LangChain ChatOpenAI
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.7,
    streaming=True,
    openai_api_key=st.secrets["OPENAI_API_KEY"]
)

if "messages" not in st.session_state:
    st.session_state.messages = []

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Handle user input
if prompt := st.chat_input("What would you like to know?"):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.markdown(prompt)
    
    # Convert to LangChain message format
    lc_messages = [
        HumanMessage(content=m["content"]) if m["role"] == "user" 
        else AIMessage(content=m["content"])
        for m in st.session_state.messages
    ]
    
    # Stream response
    with st.chat_message("assistant"):
        response = st.write_stream(llm.stream(lc_messages))
    
    st.session_state.messages.append({"role": "assistant", "content": response})

Bonus: Add model selection, temperature slider, cost estimation with LangChain callbacks

2. FastAPI Production Backend

Create REST API with /generate, /chat, /embeddings endpoints using LangChain with async/await, Pydantic validation, rate limiting, error handling.

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser
import os

app = FastAPI(title="LLM API")

# Initialize LangChain LLM
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    openai_api_key=os.getenv("OPENAI_API_KEY")
)
parser = StrOutputParser()

class ChatRequest(BaseModel):
    message: str
    system_prompt: str = "You are a helpful assistant."
    model: str = "gpt-3.5-turbo"
    temperature: float = 0.7

@app.post("/chat")
async def chat(request: ChatRequest):
    try:
        # Create LangChain chain
        messages = [
            SystemMessage(content=request.system_prompt),
            HumanMessage(content=request.message)
        ]
        
        # Use LangChain with custom temperature
        custom_llm = ChatOpenAI(
            model=request.model,
            temperature=request.temperature,
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )
        
        chain = custom_llm | parser
        response = await chain.ainvoke(messages)
        
        return {"response": response}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 100):
    try:
        chain = llm | parser
        response = await chain.ainvoke([HumanMessage(content=prompt)])
        return {"text": response}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Bonus: Add API key authentication, LangSmith tracing for monitoring

3. Self-Hosted Model Serving

Deploy Llama 3/Mistral/Phi-3 with Ollama and connect via LangChain, create Streamlit/Gradio frontend.

# Ollama setup
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3
ollama serve

# Using LangChain with Ollama
from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Initialize Ollama through LangChain
llm = Ollama(model="llama3", base_url="http://localhost:11434")

# Create a simple chain
prompt = ChatPromptTemplate.from_template("Answer this question: {question}")
chain = prompt | llm | StrOutputParser()

# Use the chain
response = chain.invoke({"question": "Why is the sky blue?"})
print(response)

Alternative with ChatOllama for chat models:

from langchain_community.chat_models import ChatOllama
from langchain_core.messages import HumanMessage

chat_model = ChatOllama(model="llama3")
response = chat_model.invoke([HumanMessage(content="Why is the sky blue?")])
print(response.content)

Bonus: Deploy to cloud (AWS EC2, GCP, Modal) with LangServe for production serving

Deployment Platforms

Platform	Best For	Pros	Cons
Hugging Face Spaces	Quick demos, ML prototypes	Free tier, GPU support	Limited customization
Streamlit Cloud	Data apps, internal tools	Free for public apps	Streamlit-only
Vercel	Next.js apps, serverless	Edge network, excellent DX	Cold starts, time limits
Railway	Full-stack apps, databases	Simple config, persistent storage	Pricing can scale
Modal	GPU workloads, ML serving	Pay-per-use, fast cold starts	Newer platform
AWS / GCP / Azure	Enterprise, full control	Scalable, comprehensive	Complex setup, cost
Render	Web services, Docker	Auto-deploy from Git	Less flexible than AWS