Introduction
Your final project synthesizes all course concepts into a production-ready LLM application. Working in teams, you’ll progress through structured milestones that mirror real-world development: from ideation and prompt engineering to RAG implementation, agent architecture, evaluation, security, deployment, and scaling. This project emphasizes not just building features, but making informed design decisions, measuring performance, and deploying responsibly.
Learning Objectives
- Design and scope an LLM application addressing real-world use cases
- Apply prompt engineering, tool calling, RAG, and multimodal capabilities systematically
- Make evidence-based decisions on fine-tuning, agent architecture, and model selection
- Implement comprehensive evaluation, security auditing, and performance monitoring
- Deploy scalable, cost-efficient systems with proper documentation and observability
Project Milestones
Milestone 1: Form Team & Ideate
Week 1
- Form team of 2 students
- Brainstorm application ideas aligned with course techniques
- Identify target users and core use case
- Research existing solutions and identify gaps
Milestone 2: Submit Project Proposal
Week 2
- Submit 1-2 page proposal including:
- Problem statement and target users
- Proposed solution and key features
- Technical stack (APIs, frameworks, databases)
- System architecture diagram
- Success criteria and evaluation metrics
- Get instructor feedback and approval
Milestone 3: Experiment with LLMs
Week 3
- Test multiple models (GPT-5, Claude, open-source models from HF) for your use case
- Compare outputs on representative examples
- Document model selection rationale (quality, cost, latency)
- Establish baseline performance metrics
Milestone 4: Design & Test Core Prompts
Week 4
- Develop prompt templates for core functionality
- Experiment with instruction, few-shot, and chain-of-thought techniques
- Create test suite with edge cases
- Iterate based on output quality and consistency
Milestone 5: Integrate Tool Calling
Week 5
- Identify necessary external tools (search, calculator, APIs, databases)
- Implement function calling with proper schemas
- Add error handling and retry logic
- Test tool selection accuracy and parameter correctness
Milestone 6: Evaluate & Justify Fine-tuning Decision
Week 6
- Assess whether prompting alone suffices or fine-tuning is needed
- If fine-tuning: collect/prepare training data, choose PEFT method (LoRA)
- Document decision with evidence (cost, performance, maintenance)
- For non-fine-tuning: optimize prompts and few-shot examples
Milestone 7: Implement RAG Pipeline
Week 7
- Set up document processing: chunking, embeddings, vector database
- Implement retrieval with similarity search and optional reranking
- Integrate retrieved context into generation
- Evaluate retrieval quality (precision@k, recall) and answer faithfulness
Milestone 8: Add Multimodal Capabilities
Week 8
- Integrate vision, audio, or other modalities if applicable to your use case
- Implement image understanding, document parsing, or multimodal search
- Test cross-modal consistency and relevance
- Document modality-specific prompt strategies
Milestone 9: Submit MVP (Minimum Viable Product)
Week 9
- Deliver working prototype with core features
- Include basic UI (Streamlit/Gradio) or API endpoints
- Demonstrate end-to-end functionality
- Present 5-minute demo to class for peer feedback
Milestone 10: Design Agent Architecture
Week 10
- If using agents: implement ReAct, planner-executor, or multi-agent system
- Define agent roles, planning logic, and tool orchestration
- Add memory and state management
- Test multi-step reasoning and task completion
Milestone 11: Measure Performance
Week 11
- Implement comprehensive evaluation metrics (accuracy, relevance, coherence, faithfulness)
- Conduct human evaluation or A/B testing
- Use tools like TruLens, RAGAS, or custom test suites
- Document performance across different scenarios and failure modes
Milestone 12: Security Audit
Week 12
- Conduct red teaming: test for prompt injection, jailbreaks, data leakage
- Implement input validation, output filtering, rate limiting
- Review API key management and data privacy
- Document security measures and known limitations
Milestone 13: Deploy & Test
Week 13
- Deploy to cloud platform (HuggingFace Spaces, Streamlit Cloud, Docker, etc.)
- Set up monitoring and logging (LangSmith, custom dashboards)
- Load test with realistic usage patterns
- Implement observability for debugging production issues
Milestone 14: Attempt to Scale & Calculate Costs
Week 14
- Optimize inference: batching, caching, quantization
- Implement model cascades or FrugalGPT strategies if applicable
- Calculate projected costs (API calls, compute, storage)
- Document cost-performance trade-offs and scaling bottlenecks
Milestone 15: Wrap Up & Go Live!
Week 15
- Finalize documentation: README, architecture diagrams, API docs
- Prepare presentation covering technical choices and lessons learned
- Public launch with polished UI
- Share demo link and repository
- Present final project to class (10-minute presentation + Q&A)
- Submit all deliverables
Timeline Overview
| Week | Due Date | Milestone | Key Activities | Points |
|---|---|---|---|---|
| 1 | Jan 20 | Form Team & Ideate | Team formation, brainstorming, research | 10 |
| 2 | Jan 27 | Submit Proposal | Write and submit project proposal | 10 |
| 3 | Feb 3 | Experiment with LLMs | Model comparison and baseline testing | 10 |
| 4 | Feb 10 | Design & Test Core Prompts | Prompt engineering and iteration | 10 |
| 5 | Feb 17 | Integrate Tool Calling | Implement function calling and external tools | 10 |
| 6 | Feb 24 | Evaluate Fine-tuning Decision | Data collection, PEFT exploration, decision documentation | 10 |
| 7 | Mar 3 | Implement RAG Pipeline | Document processing, embeddings, retrieval, evaluation | 10 |
| 8 | Mar 10 | Add Multimodal Capabilities | Vision/audio integration (if applicable) | +10 (Extra Credit) |
| 9 | Mar 17 | Submit MVP | Working prototype demo and peer feedback | 10 |
| 10 | Mar 24 | Design Agent Architecture | Agent planning, tool orchestration, memory | 10 |
| 11 | Mar 31 | Measure Performance | Comprehensive evaluation and testing | 10 |
| 12 | Apr 7 | Security Audit | Red teaming, input validation, safety testing | 10 |
| 13 | Apr 14 | Deploy & Test | Production deployment, monitoring, load testing | 10 |
| 14 | Apr 21 | Scale & Calculate Costs | Optimization, cost analysis, scaling strategies | 10 |
| 15 | Apr 28 | Wrap Up & Go Live! | Documentation, demo video, final preparation | 10 |
| Total | 150 (140 + 10 extra credit) |
Technical Stack Recommendations
| Component | Options |
|---|---|
| LLM APIs | OpenAI (GPT-4, GPT-4o), Anthropic (Claude 3.5), Cohere, Google Gemini |
| Open-Source Models | Llama 3.x, Mistral, Gemma (via HuggingFace, Ollama, vLLM) |
| Frontend | Streamlit, Gradio, Flask + React, FastAPI + HTML/JS |
| Agent Frameworks | LangChain, LlamaIndex, AutoGen, LangGraph, SmolAgents |
| Vector Databases | FAISS, ChromaDB, Weaviate, Pinecone, Qdrant |
| Embeddings | OpenAI embeddings, sentence-transformers, Cohere embeddings |
| Evaluation | TruLens, RAGAS, PromptFoo, OpenAI Evals, custom test suites |
| Monitoring | LangSmith, Weights & Biases, custom logging (Prometheus + Grafana) |
| Deployment | HuggingFace Spaces, Streamlit Cloud, Render, Fly.io, Docker + AWS/GCP/Azure |
| Fine-tuning | HuggingFace Trainer, Axolotl, OpenAI fine-tuning API, Cohere fine-tuning |
Project Ideas & Examples
| Domain | Project Idea | Real-World Examples |
|---|---|---|
| Legal Tech | Contract analyzer with clause extraction, risk assessment, and Q&A over legal documents | Harvey AI, Casetext CoCounsel, LawGeex, Spellbook |
| Education | Adaptive tutor with multimodal support (diagrams, code), personalized explanations, and progress tracking | Khan Academy Khanmigo, Duolingo Max, Cognii, Socratic by Google |
| Research Tools | Scientific paper assistant with PDF parsing, citation analysis, and literature review generation | Elicit, Semantic Scholar, Consensus, SciSpace Copilot |
| Customer Support | Multi-turn chatbot with FAQ retrieval, ticket classification, and escalation logic | Intercom Fin, Zendesk AI, Ada, Forethought |
| Creative Writing | Story development tool with character consistency, plot outlining, and style adaptation | Sudowrite, NovelAI, Jasper, Storyteller by Anthropic |
| Healthcare | Medical literature Q&A with RAG over clinical guidelines (ensure regulatory compliance) | Glass Health, Nabla Copilot, Hippocratic AI, Nuance DAX |
| Finance | Investment research assistant with real-time data retrieval and risk analysis | Bloomberg GPT, AlphaSense, FinChat, Daloopa |
| Developer Tools | Code review agent with bug detection, refactoring suggestions, and documentation generation | GitHub Copilot, Tabnine, Cursor, Codeium, Amazon CodeWhisperer |
Evaluation & Testing Strategy
Model Evaluation:
- Output quality: Accuracy, relevance, coherence, factuality
- Use task-specific metrics: BLEU/ROUGE (summarization), EM/F1 (QA), classification accuracy
- Model-based evaluation: GPT-4 as judge for open-ended tasks
RAG Evaluation:
- Retrieval metrics: Precision@k, Recall@k, MRR
- Generation metrics: Faithfulness, answer relevance, context precision (RAGAS)
- End-to-end: Correctness of final answers with source attribution
Agent Evaluation:
- Task completion rate, tool selection accuracy, reasoning coherence
- Multi-step correctness, error recovery, state management
- Use AgentBench or custom task scenarios
Safety & Robustness:
- Red teaming: Prompt injection, jailbreaks, adversarial inputs
- Bias testing: Demographic parity, stereotype amplification
- Hallucination detection: Citation accuracy, fact-checking
Performance Metrics:
- Latency (p50, p95, p99), throughput (requests/sec)
- Cost per request, token usage efficiency
- Error rates and failure analysis
Final Deliverables
Code & Documentation
- GitHub Repository with clean code structure and version control history
- README including:
- Project overview and motivation
- Architecture diagram
- Setup instructions (dependencies, API keys, environment variables)
- Usage examples and sample prompts
- Known limitations and future work
- Technical Documentation: Design decisions, model selection rationale, evaluation results
- API Documentation (if applicable): Endpoint descriptions, request/response schemas
Deployment
- Live Demo Link: Publicly accessible application (HuggingFace Space, Streamlit Cloud, etc.)
- Monitoring Dashboard (optional): LangSmith traces, cost tracking, performance metrics
Presentation
- 10-minute Demo Video covering:
- Problem statement and user value proposition
- Live walkthrough of key features
- Technical architecture and design choices
- Evaluation results and performance analysis
- Challenges faced and lessons learned
- Final Presentation (in-class): 10-minute talk + 5-minute Q&A
Evaluation Report
- Performance Analysis: Metrics, benchmarks, comparison to baselines
- Cost Analysis: API costs, compute expenses, projected scaling costs
- Security & Safety: Red teaming results, mitigation strategies
- User Feedback (if applicable): Usability testing, A/B test results
Grading Rubric
| Criteria | Points | Description |
|---|---|---|
| Milestones & Process | 140 | Completion of all 14 weekly milestones (10 points each) |
| Multimodal Capabilities | +10 | Extra Credit: Vision/audio integration |
| Technical Implementation & Evaluation | 15 | |
| - Core Functionality | 5 | Working application with essential features |
| - Integration of Course Concepts | 4 | Effective use of prompting, RAG, tools, agents, or fine-tuning |
| - Code Quality & Architecture | 3 | Clean code, modular design, proper error handling |
| - Comprehensive Metrics & Testing | 3 | Task-appropriate evaluation metrics, performance analysis, safety testing |
| Design & Innovation | 15 | |
| - Problem Scoping & Justification | 6 | Clear use case with evidence of need |
| - Technical Decisions & Trade-offs | 5 | Justified model selection, architecture choices, cost-performance analysis |
| - Creativity & Novelty | 4 | Innovative features or approaches |
| Deployment & Usability | 15 | |
| - Production Deployment | 6 | Live, accessible application with monitoring and uptime |
| - User Experience | 5 | Intuitive interface, clear instructions, error messages |
| - Documentation | 4 | Complete setup guide, usage examples, architecture docs |
| Presentation & Communication | 15 | |
| - Demo Quality | 6 | Clear demonstration of features and value |
| - Technical Communication | 5 | Explanation of design choices and trade-offs |
| - Evaluation Results & Insights | 4 | Data-driven insights and lessons learned |
| TOTAL | 210 | (200 base + 10 extra credit) |
Tips for Success
Start Simple, Iterate Fast
- Begin with basic prompts and a single model
- Add complexity incrementally after validating core functionality
- Don’t over-engineer early—focus on user value first
Make Data-Driven Decisions
- Document all technical choices with evidence (metrics, cost analysis)
- Compare alternatives quantitatively before committing
- Track metrics from day one to measure progress
Plan for Failure Modes
- Test edge cases, adversarial inputs, and error conditions early
- Implement graceful degradation and informative error messages
- Budget time for debugging and iteration
Communicate Regularly
- Schedule weekly team check-ins
- Document decisions and blockers in GitHub issues
- Seek instructor feedback at milestone checkpoints
Think Production from Day One
- Use environment variables for API keys (never commit secrets)
- Implement logging and monitoring early
- Write deployment instructions as you build
Leverage Course Resources
- Reuse code patterns from assignments and readings
- Consult office hours for technical guidance
- Share learnings with classmates (collaboration is encouraged!)
Resources
Example Projects
- LangChain Templates: Production-ready application templates
- HuggingFace Spaces: Browse deployed LLM applications for inspiration
- LlamaIndex Examples: RAG and agent implementations
Deployment Guides
Evaluation Tools
Best Practices