Final Project – Full-Stack LLM Application

Introduction

Your final project synthesizes all course concepts into a production-ready LLM application. Working in teams, you’ll progress through structured milestones that mirror real-world development: from ideation and prompt engineering to RAG implementation, agent architecture, evaluation, security, deployment, and scaling. This project emphasizes not just building features, but making informed design decisions, measuring performance, and deploying responsibly.


Learning Objectives

  • Design and scope an LLM application addressing real-world use cases
  • Apply prompt engineering, tool calling, RAG, and multimodal capabilities systematically
  • Make evidence-based decisions on fine-tuning, agent architecture, and model selection
  • Implement comprehensive evaluation, security auditing, and performance monitoring
  • Deploy scalable, cost-efficient systems with proper documentation and observability

Project Milestones

Milestone 1: Form Team & Ideate

Week 1

  • Form team of 2 students
  • Brainstorm application ideas aligned with course techniques
  • Identify target users and core use case
  • Research existing solutions and identify gaps

Milestone 2: Submit Project Proposal

Week 2

  • Submit 1-2 page proposal including:
    • Problem statement and target users
    • Proposed solution and key features
    • Technical stack (APIs, frameworks, databases)
    • System architecture diagram
    • Success criteria and evaluation metrics
  • Get instructor feedback and approval

Milestone 3: Experiment with LLMs

Week 3

  • Test multiple models (GPT-5, Claude, open-source models from HF) for your use case
  • Compare outputs on representative examples
  • Document model selection rationale (quality, cost, latency)
  • Establish baseline performance metrics

Milestone 4: Design & Test Core Prompts

Week 4

  • Develop prompt templates for core functionality
  • Experiment with instruction, few-shot, and chain-of-thought techniques
  • Create test suite with edge cases
  • Iterate based on output quality and consistency

Milestone 5: Integrate Tool Calling

Week 5

  • Identify necessary external tools (search, calculator, APIs, databases)
  • Implement function calling with proper schemas
  • Add error handling and retry logic
  • Test tool selection accuracy and parameter correctness

Milestone 6: Evaluate & Justify Fine-tuning Decision

Week 6

  • Assess whether prompting alone suffices or fine-tuning is needed
  • If fine-tuning: collect/prepare training data, choose PEFT method (LoRA)
  • Document decision with evidence (cost, performance, maintenance)
  • For non-fine-tuning: optimize prompts and few-shot examples

Milestone 7: Implement RAG Pipeline

Week 7

  • Set up document processing: chunking, embeddings, vector database
  • Implement retrieval with similarity search and optional reranking
  • Integrate retrieved context into generation
  • Evaluate retrieval quality (precision@k, recall) and answer faithfulness

Milestone 8: Add Multimodal Capabilities

Week 8

  • Integrate vision, audio, or other modalities if applicable to your use case
  • Implement image understanding, document parsing, or multimodal search
  • Test cross-modal consistency and relevance
  • Document modality-specific prompt strategies

Milestone 9: Submit MVP (Minimum Viable Product)

Week 9

  • Deliver working prototype with core features
  • Include basic UI (Streamlit/Gradio) or API endpoints
  • Demonstrate end-to-end functionality
  • Present 5-minute demo to class for peer feedback

Milestone 10: Design Agent Architecture

Week 10

  • If using agents: implement ReAct, planner-executor, or multi-agent system
  • Define agent roles, planning logic, and tool orchestration
  • Add memory and state management
  • Test multi-step reasoning and task completion

Milestone 11: Measure Performance

Week 11

  • Implement comprehensive evaluation metrics (accuracy, relevance, coherence, faithfulness)
  • Conduct human evaluation or A/B testing
  • Use tools like TruLens, RAGAS, or custom test suites
  • Document performance across different scenarios and failure modes

Milestone 12: Security Audit

Week 12

  • Conduct red teaming: test for prompt injection, jailbreaks, data leakage
  • Implement input validation, output filtering, rate limiting
  • Review API key management and data privacy
  • Document security measures and known limitations

Milestone 13: Deploy & Test

Week 13

  • Deploy to cloud platform (HuggingFace Spaces, Streamlit Cloud, Docker, etc.)
  • Set up monitoring and logging (LangSmith, custom dashboards)
  • Load test with realistic usage patterns
  • Implement observability for debugging production issues

Milestone 14: Attempt to Scale & Calculate Costs

Week 14

  • Optimize inference: batching, caching, quantization
  • Implement model cascades or FrugalGPT strategies if applicable
  • Calculate projected costs (API calls, compute, storage)
  • Document cost-performance trade-offs and scaling bottlenecks

Milestone 15: Wrap Up & Go Live!

Week 15

  • Finalize documentation: README, architecture diagrams, API docs
  • Prepare presentation covering technical choices and lessons learned
  • Public launch with polished UI
  • Share demo link and repository
  • Present final project to class (10-minute presentation + Q&A)
  • Submit all deliverables

Timeline Overview

Week Due Date Milestone Key Activities Points
1 Jan 20 Form Team & Ideate Team formation, brainstorming, research 10
2 Jan 27 Submit Proposal Write and submit project proposal 10
3 Feb 3 Experiment with LLMs Model comparison and baseline testing 10
4 Feb 10 Design & Test Core Prompts Prompt engineering and iteration 10
5 Feb 17 Integrate Tool Calling Implement function calling and external tools 10
6 Feb 24 Evaluate Fine-tuning Decision Data collection, PEFT exploration, decision documentation 10
7 Mar 3 Implement RAG Pipeline Document processing, embeddings, retrieval, evaluation 10
8 Mar 10 Add Multimodal Capabilities Vision/audio integration (if applicable) +10 (Extra Credit)
9 Mar 17 Submit MVP Working prototype demo and peer feedback 10
10 Mar 24 Design Agent Architecture Agent planning, tool orchestration, memory 10
11 Mar 31 Measure Performance Comprehensive evaluation and testing 10
12 Apr 7 Security Audit Red teaming, input validation, safety testing 10
13 Apr 14 Deploy & Test Production deployment, monitoring, load testing 10
14 Apr 21 Scale & Calculate Costs Optimization, cost analysis, scaling strategies 10
15 Apr 28 Wrap Up & Go Live! Documentation, demo video, final preparation 10
Total 150 (140 + 10 extra credit)

Technical Stack Recommendations

Component Options
LLM APIs OpenAI (GPT-4, GPT-4o), Anthropic (Claude 3.5), Cohere, Google Gemini
Open-Source Models Llama 3.x, Mistral, Gemma (via HuggingFace, Ollama, vLLM)
Frontend Streamlit, Gradio, Flask + React, FastAPI + HTML/JS
Agent Frameworks LangChain, LlamaIndex, AutoGen, LangGraph, SmolAgents
Vector Databases FAISS, ChromaDB, Weaviate, Pinecone, Qdrant
Embeddings OpenAI embeddings, sentence-transformers, Cohere embeddings
Evaluation TruLens, RAGAS, PromptFoo, OpenAI Evals, custom test suites
Monitoring LangSmith, Weights & Biases, custom logging (Prometheus + Grafana)
Deployment HuggingFace Spaces, Streamlit Cloud, Render, Fly.io, Docker + AWS/GCP/Azure
Fine-tuning HuggingFace Trainer, Axolotl, OpenAI fine-tuning API, Cohere fine-tuning

Project Ideas & Examples

Domain Project Idea Real-World Examples
Legal Tech Contract analyzer with clause extraction, risk assessment, and Q&A over legal documents Harvey AI, Casetext CoCounsel, LawGeex, Spellbook
Education Adaptive tutor with multimodal support (diagrams, code), personalized explanations, and progress tracking Khan Academy Khanmigo, Duolingo Max, Cognii, Socratic by Google
Research Tools Scientific paper assistant with PDF parsing, citation analysis, and literature review generation Elicit, Semantic Scholar, Consensus, SciSpace Copilot
Customer Support Multi-turn chatbot with FAQ retrieval, ticket classification, and escalation logic Intercom Fin, Zendesk AI, Ada, Forethought
Creative Writing Story development tool with character consistency, plot outlining, and style adaptation Sudowrite, NovelAI, Jasper, Storyteller by Anthropic
Healthcare Medical literature Q&A with RAG over clinical guidelines (ensure regulatory compliance) Glass Health, Nabla Copilot, Hippocratic AI, Nuance DAX
Finance Investment research assistant with real-time data retrieval and risk analysis Bloomberg GPT, AlphaSense, FinChat, Daloopa
Developer Tools Code review agent with bug detection, refactoring suggestions, and documentation generation GitHub Copilot, Tabnine, Cursor, Codeium, Amazon CodeWhisperer

Evaluation & Testing Strategy

Model Evaluation:

  • Output quality: Accuracy, relevance, coherence, factuality
  • Use task-specific metrics: BLEU/ROUGE (summarization), EM/F1 (QA), classification accuracy
  • Model-based evaluation: GPT-4 as judge for open-ended tasks

RAG Evaluation:

  • Retrieval metrics: Precision@k, Recall@k, MRR
  • Generation metrics: Faithfulness, answer relevance, context precision (RAGAS)
  • End-to-end: Correctness of final answers with source attribution

Agent Evaluation:

  • Task completion rate, tool selection accuracy, reasoning coherence
  • Multi-step correctness, error recovery, state management
  • Use AgentBench or custom task scenarios

Safety & Robustness:

  • Red teaming: Prompt injection, jailbreaks, adversarial inputs
  • Bias testing: Demographic parity, stereotype amplification
  • Hallucination detection: Citation accuracy, fact-checking

Performance Metrics:

  • Latency (p50, p95, p99), throughput (requests/sec)
  • Cost per request, token usage efficiency
  • Error rates and failure analysis

Final Deliverables

Code & Documentation

  • GitHub Repository with clean code structure and version control history
  • README including:
    • Project overview and motivation
    • Architecture diagram
    • Setup instructions (dependencies, API keys, environment variables)
    • Usage examples and sample prompts
    • Known limitations and future work
  • Technical Documentation: Design decisions, model selection rationale, evaluation results
  • API Documentation (if applicable): Endpoint descriptions, request/response schemas

Deployment

  • Live Demo Link: Publicly accessible application (HuggingFace Space, Streamlit Cloud, etc.)
  • Monitoring Dashboard (optional): LangSmith traces, cost tracking, performance metrics

Presentation

  • 10-minute Demo Video covering:
    • Problem statement and user value proposition
    • Live walkthrough of key features
    • Technical architecture and design choices
    • Evaluation results and performance analysis
    • Challenges faced and lessons learned
  • Final Presentation (in-class): 10-minute talk + 5-minute Q&A

Evaluation Report

  • Performance Analysis: Metrics, benchmarks, comparison to baselines
  • Cost Analysis: API costs, compute expenses, projected scaling costs
  • Security & Safety: Red teaming results, mitigation strategies
  • User Feedback (if applicable): Usability testing, A/B test results

Grading Rubric

Criteria Points Description
Milestones & Process 140 Completion of all 14 weekly milestones (10 points each)
Multimodal Capabilities +10 Extra Credit: Vision/audio integration
Technical Implementation & Evaluation 15
- Core Functionality 5 Working application with essential features
- Integration of Course Concepts 4 Effective use of prompting, RAG, tools, agents, or fine-tuning
- Code Quality & Architecture 3 Clean code, modular design, proper error handling
- Comprehensive Metrics & Testing 3 Task-appropriate evaluation metrics, performance analysis, safety testing
Design & Innovation 15
- Problem Scoping & Justification 6 Clear use case with evidence of need
- Technical Decisions & Trade-offs 5 Justified model selection, architecture choices, cost-performance analysis
- Creativity & Novelty 4 Innovative features or approaches
Deployment & Usability 15
- Production Deployment 6 Live, accessible application with monitoring and uptime
- User Experience 5 Intuitive interface, clear instructions, error messages
- Documentation 4 Complete setup guide, usage examples, architecture docs
Presentation & Communication 15
- Demo Quality 6 Clear demonstration of features and value
- Technical Communication 5 Explanation of design choices and trade-offs
- Evaluation Results & Insights 4 Data-driven insights and lessons learned
TOTAL 210 (200 base + 10 extra credit)


Tips for Success

Start Simple, Iterate Fast

  • Begin with basic prompts and a single model
  • Add complexity incrementally after validating core functionality
  • Don’t over-engineer early—focus on user value first

Make Data-Driven Decisions

  • Document all technical choices with evidence (metrics, cost analysis)
  • Compare alternatives quantitatively before committing
  • Track metrics from day one to measure progress

Plan for Failure Modes

  • Test edge cases, adversarial inputs, and error conditions early
  • Implement graceful degradation and informative error messages
  • Budget time for debugging and iteration

Communicate Regularly

  • Schedule weekly team check-ins
  • Document decisions and blockers in GitHub issues
  • Seek instructor feedback at milestone checkpoints

Think Production from Day One

  • Use environment variables for API keys (never commit secrets)
  • Implement logging and monitoring early
  • Write deployment instructions as you build

Leverage Course Resources

  • Reuse code patterns from assignments and readings
  • Consult office hours for technical guidance
  • Share learnings with classmates (collaboration is encouraged!)

Resources

Example Projects

Deployment Guides

Evaluation Tools

Best Practices