Introduction
Prompt engineering is the art and science of crafting inputs that elicit the desired output from a large language model. Since most LLMs (e.g., GPT-4, Claude, Cohere, open-source chat models) are not fine-tuned for every use case, prompts serve as the primary interface to shape model behavior. This week, we’ll cover how to structure prompts—from basic instructions to few-shot examples and chain-of-thought (CoT) reasoning1—plus practical templates, iteration strategies, and evaluation. Prompt engineering enables task adaptation without retraining the model and is the fastest lever to improve LLM application quality.
Goals for the Week
- Understand foundational prompt formats: instruction2, few-shot3, and CoT prompting1.
- Analyze how prompt phrasing, constraints, and structure affect output quality and consistency.
- Use prompt templates programmatically (string templates, chat message stacks, prompt libraries).
- Build robust prompts for common tasks: summarization, translation, classification, extraction, reasoning.
- Evaluate and iterate prompts with checklists and lightweight metrics.
Learning Guide
Videos
Readings
- Prompt engineering guide by DAIR.AI: Exhaustive list of prompt techniques with examples
Foundational Research:
-
Chain-of-Thought Prompting1: Enables complex reasoning in large language models through intermediate reasoning steps
-
Few-Shot Learning3: In-context learning with demonstrations for task adaptation without parameter updates
-
Instruction Following2: Training and prompting language models to follow natural language instructions
-
Prompt Design4: Systematic approaches to iterative prompt development and optimization
-
Prompting examples by API’s
Best Practices
Use these practical patterns to improve reliability and control:
-
Be clear and explicit
- Define the task, audience, constraints, and desired tone.
- Provide the input within delimiters (e.g., triple backticks
…).
-
Structure the output
- Request specific formats: JSON, tables, bullet lists, HTML.
- Specify keys/fields, ordering, and validation rules.
-
Set constraints
- Word/sentence/character limits; required sections; enumerated steps.
- Temperature and randomness control in API settings.
-
Encourage reasoning
- Ask for step-by-step analysis or “explain before answering”.
- For math/logic, require explicit intermediate steps or self-checks1.
-
Use few-shot examples
- Show 1–3 examples with inputs and ideal outputs3.
- Keep examples representative and concise.
-
Iterate systematically
- Identify issues (length, focus, format, correctness) and refine4.
- Add missing constraints; change audience; enforce schema.
Prompt Templates (Copy/Paste)
Summarization (length + focus)
prompt = f"""
You are an assistant that writes concise summaries.
Summarize the text delimited by triple backticks in at most 3 sentences,
focusing on shipping and delivery issues.
Text: ```{text}```
"""
Classification (labels + justification)
prompt = f"""
Classify the sentiment of the review as one of: positive, neutral, negative.
Return JSON with fields: label, confidence, rationale (1–2 sentences).
Review: ```{review}```
"""
Information Extraction (strict JSON)
prompt = f"""
Extract the following fields from the text, return STRICT JSON only:
{{
"product": string,
"brand": string,
"issue": string|null,
"shipping": {{
"mentioned": boolean,
"details": string|null
}}
}}
If unknown, use null.
Text: ```{text}```
"""
Chain-of-Thought (visible reasoning)
prompt = f"""
Solve the problem step by step. Show your reasoning, then provide the final answer
on a new line prefixed with "Answer:".
Problem: ```{problem}```
"""
Chain-of-Thought (concise, hidden reasoning)
prompt = f"""
Think through the problem privately. Then provide only the final answer
on a single line prefixed with "Answer:" without revealing intermediate steps.
Problem: ```{problem}```
"""
Data Validation (self-check)
prompt = f"""
Return JSON per the schema below. After producing the JSON, verify that:
- All required keys are present; no extra keys.
- Values match the expected types.
If the validation fails, correct the JSON and output only the corrected JSON.
Schema: {{
"title": string,
"items": array<string>,
"count": integer
}}
Input: ```{input_text}```
"""
Debugging and Iteration
Common issues and targeted fixes:
- Output too long → Add word/sentence limits; require bullet points.
- Wrong focus → Specify audience and what to emphasize/omit.
- Unstructured output → Enforce JSON/table format with explicit keys.
- Hallucinated facts → Require citations or “only use provided context”.
- Schema drift → Add self-checks and explicit validation instructions.
Iteration loop:
Draft prompt → Run → Inspect output → Identify gaps → Add constraints/examples → Re-run
Evaluation & Guardrails
- Determinism: Use temperature=0 for reproducibility in evaluation.
- Schema checks: Parse/validate JSON; reject invalid outputs.
- Self-consistency: Ask for multiple solutions and pick majority (optional)5.
- Safety: Instruct the model to refuse harmful content; constrain to provided context6.
- Metrics: Track task accuracy, format compliance rate, length adherence.
Programming Practice
Core Tasks
- Create and test prompts for summarization, sentiment classification, extraction, and QA using HuggingFace/OpenAI/Cohere APIs.
- Implement zero-shot vs. few-shot vs. CoT variants; compare accuracy and format compliance.
- Add JSON schema validation to extraction prompts; record valid vs. invalid rates.
- Iterate prompts to fix one concrete issue (e.g., too verbose, missing fields) and document before/after outputs.
Quick Start Code (OpenAI Example)
import openai
import json
def test_prompt(prompt, test_cases):
results = []
for case in test_cases:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt.format(**case)}],
temperature=0
)
results.append({
"input": case,
"output": response.choices[0].message.content,
"valid_json": is_valid_json(response.choices[0].message.content)
})
return results
def is_valid_json(text):
try:
json.loads(text)
return True
except:
return False
Assessment Rubric (Flexible - Adapt to Your Needs)
- Task performance: Does the output meet the objective? (0–2)
- Format compliance: Does it match requested structure? (0–2)
- Constraint adherence: Length/keys/tone satisfied? (0–2)
- Clarity: Is the output readable and useful? (0–2)
- Iteration quality: Are refinements targeted and effective? (0–2)
Max: 10 points. Document examples and decisions.
References
-
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., … & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35. arXiv:2201.11903. ↩︎ ↩︎ ↩︎ ↩︎
-
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35. arXiv:2203.02155. ↩︎ ↩︎
-
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33. arXiv:2005.14165. ↩︎ ↩︎ ↩︎
-
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2021). Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv:2107.13586. ↩︎ ↩︎
-
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., … & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv:2203.11171. (ICLR 2023) ↩︎
-
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., … & Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862. ↩︎