Introduction
This week focuses on integrating LLMs into production-grade applications using APIs, SDKs, and frameworks like LangChain. You’ll learn how to call LLM APIs (OpenAI, Anthropic, Cohere, Hugging Face) through LangChain abstractions, manage streaming/batching, handle rate limits and costs, and deploy backends using Flask, FastAPI, or Streamlit. You’ll also explore serving your own models with Ollama, vLLM, or Text Generation Inference.
Goals for the Week
- Integrate LLM APIs into web apps and backends using LangChain (sync/async, streaming, batching)
- Handle rate limits (TPM/RPM), implement retry logic, and optimize costs
- Deploy LLM services using serverless, containers, or managed inference
- Serve open-source models with Ollama, TGI, or vLLM through LangChain
- Implement best practices for security, monitoring, and observability with LangSmith
Learning Guide
Core Readings
- 10 ways to serve LLMs: Comprehensive guide covering deployment from cloud APIs to self-hosted solutions
- HuggingFace Inference Providers: Overview of Inference API, Endpoints, and deployment options
- LangChain Documentation: Framework for building LLM applications with chains, agents, and integrations
Web Frameworks
- Streamlit: Quick demos, internal tools, data apps with LangChain integration
- FastAPI: Production REST APIs, async/await, high-performance backends
- LangServe: Deploy LangChain chains as REST APIs with automatic OpenAPI docs
- Full-Stack (Flask + React): Complete web applications
Model Serving
- Ollama: Local development, simple CLI/API, OpenAI-compatible
- HuggingFace TGI: Production serving, tensor parallelism, quantization, flash attention
- vLLM: High-throughput with PagedAttention, continuous batching, OpenAI-compatible
Provider SDKs & LangChain Integrations
| Provider | SDK | LangChain Package | Key Features |
|---|---|---|---|
| OpenAI | openai |
langchain-openai |
Chat completions, streaming, function calling |
| Anthropic | anthropic |
langchain-anthropic |
Extended context, constitutional AI |
| Cohere | cohere |
langchain-cohere |
Enterprise search, embeddings, reranking |
| Hugging Face | huggingface_hub |
langchain-huggingface |
100k+ models, inference API, versioning |
| Ollama | API | langchain-community |
Local models, OpenAI-compatible |
| Google AI | google-generativeai |
langchain-google-genai |
Multimodal, grounding with search |
Examples
1. Streamlit Chatbot App
Build an interactive chatbot with streaming, session history, and token tracking using LangChain.
import streamlit as st
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage
st.title("๐ฌ LLM Chatbot")
# Initialize LangChain ChatOpenAI
llm = ChatOpenAI(
model="gpt-3.5-turbo",
temperature=0.7,
streaming=True,
openai_api_key=st.secrets["OPENAI_API_KEY"]
)
if "messages" not in st.session_state:
st.session_state.messages = []
# Display chat history
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# Handle user input
if prompt := st.chat_input("What would you like to know?"):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
# Convert to LangChain message format
lc_messages = [
HumanMessage(content=m["content"]) if m["role"] == "user"
else AIMessage(content=m["content"])
for m in st.session_state.messages
]
# Stream response
with st.chat_message("assistant"):
response = st.write_stream(llm.stream(lc_messages))
st.session_state.messages.append({"role": "assistant", "content": response})
Bonus: Add model selection, temperature slider, cost estimation with LangChain callbacks
2. FastAPI Production Backend
Create REST API with /generate, /chat, /embeddings endpoints using LangChain with async/await, Pydantic validation, rate limiting, error handling.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser
import os
app = FastAPI(title="LLM API")
# Initialize LangChain LLM
llm = ChatOpenAI(
model="gpt-3.5-turbo",
openai_api_key=os.getenv("OPENAI_API_KEY")
)
parser = StrOutputParser()
class ChatRequest(BaseModel):
message: str
system_prompt: str = "You are a helpful assistant."
model: str = "gpt-3.5-turbo"
temperature: float = 0.7
@app.post("/chat")
async def chat(request: ChatRequest):
try:
# Create LangChain chain
messages = [
SystemMessage(content=request.system_prompt),
HumanMessage(content=request.message)
]
# Use LangChain with custom temperature
custom_llm = ChatOpenAI(
model=request.model,
temperature=request.temperature,
openai_api_key=os.getenv("OPENAI_API_KEY")
)
chain = custom_llm | parser
response = await chain.ainvoke(messages)
return {"response": response}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 100):
try:
chain = llm | parser
response = await chain.ainvoke([HumanMessage(content=prompt)])
return {"text": response}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Bonus: Add API key authentication, LangSmith tracing for monitoring
3. Self-Hosted Model Serving
Deploy Llama 3/Mistral/Phi-3 with Ollama and connect via LangChain, create Streamlit/Gradio frontend.
# Ollama setup
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3
ollama serve
# Using LangChain with Ollama
from langchain_community.llms import Ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Initialize Ollama through LangChain
llm = Ollama(model="llama3", base_url="http://localhost:11434")
# Create a simple chain
prompt = ChatPromptTemplate.from_template("Answer this question: {question}")
chain = prompt | llm | StrOutputParser()
# Use the chain
response = chain.invoke({"question": "Why is the sky blue?"})
print(response)
Alternative with ChatOllama for chat models:
from langchain_community.chat_models import ChatOllama
from langchain_core.messages import HumanMessage
chat_model = ChatOllama(model="llama3")
response = chat_model.invoke([HumanMessage(content="Why is the sky blue?")])
print(response.content)
Bonus: Deploy to cloud (AWS EC2, GCP, Modal) with LangServe for production serving
Deployment Platforms
| Platform | Best For | Pros | Cons |
|---|---|---|---|
| Hugging Face Spaces | Quick demos, ML prototypes | Free tier, GPU support | Limited customization |
| Streamlit Cloud | Data apps, internal tools | Free for public apps | Streamlit-only |
| Vercel | Next.js apps, serverless | Edge network, excellent DX | Cold starts, time limits |
| Railway | Full-stack apps, databases | Simple config, persistent storage | Pricing can scale |
| Modal | GPU workloads, ML serving | Pay-per-use, fast cold starts | Newer platform |
| AWS / GCP / Azure | Enterprise, full control | Scalable, comprehensive | Complex setup, cost |
| Render | Web services, Docker | Auto-deploy from Git | Less flexible than AWS |