Week 5: Tool Use with LLMs

Introduction

Large Language Models (LLMs) excel at language generation, but their capabilities are fundamentally limited by their training data cutoff and inability to access real-time information or perform precise computations. Tool use (also called function calling or tool augmentation) enables LLMs to overcome these limitations by interfacing with external systems: search engines, calculators, APIs, databases, and code interpreters. This week focuses on the fundamentals of tool calling: how LLMs are configured to select and invoke external tools, how to design effective tool schemas, and how to implement basic function calling with major providers (OpenAI, Anthropic, Cohere) and open-source models. Note: This week introduces tool calling mechanics—we’ll explore advanced agent architectures, planning, and multi-step reasoning in Week 11: Agents & Planning.

Goals for the Week

Understand why tool use is necessary and when to use it vs. direct prompting.
Learn the mechanics of function calling: tool schemas, parameter passing, result handling.
Implement single-step and basic multi-step tool calling with OpenAI, Anthropic, and open-source models.
Design clear tool descriptions and robust parameter schemas.
Handle errors, validate outputs, and implement retry logic.
Evaluate tool usage: selection accuracy, parameter correctness, response quality.

Learning Guide

Why Tool Use Matters

Limitations of Pure LLMs:

Knowledge Cutoff: Training data is frozen; cannot access current events or real-time data.
Computational Weakness: Struggle with precise arithmetic, complex calculations, symbolic reasoning.
Hallucination Risk: May confidently generate plausible but incorrect information.
No External State: Cannot read files, query databases, or interact with APIs.

Benefits of Tool-Augmented LLMs:

Grounding: Retrieve factual information from authoritative sources (search, databases, knowledge graphs).
Precision: Delegate calculations to deterministic tools (calculators, code interpreters).
Actuation: Perform real-world actions (send emails, update databases, control systems).
Specialization: Access domain-specific tools (weather APIs, financial data, medical databases).

Core Concepts

Function Calling: The Basics

Function calling is an API-level feature where LLMs output structured function calls (JSON) that your code executes¹.

Basic Flow:

Define Tools: Specify available functions with schemas (name, description, parameters)
Send Query: User asks a question requiring external information
Model Decides: LLM determines which tool(s) to call and with what arguments
Execute Tool: Your code runs the actual function
Return Result: Send tool output back to LLM
Generate Answer: LLM synthesizes final response using tool results

Tool Schema Design

Tools are described via structured schemas (function signatures):

{
  "name": "get_weather",
  "description": "Get current weather for a location",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name, e.g., 'London' or 'New York, NY'"
      },
      "units": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"],
        "description": "Temperature units"
      }
    },
    "required": ["location"]
  }
}

Best practices: Clear descriptions, explicit types, required fields, enums for constraints.

Videos & Interactive Tutorials

Recommended Courses:

Quickstart: LangChain Essentials - Python
- Watch Lessons: 1 (agents), 2 (messages), 3 (streaming), 4 (tools), 6 (memory)
Anthropic - Tool Use (Function Calling)
- Interactive notebooks on Claude’s tool use capabilities

API Documentation & Guides

Major Provider APIs:

OpenAI Function Calling
- Function Calling Guide
- API Reference
- Supports parallel function calls, streaming, and structured outputs
Anthropic Tool Use²
- Tool Use Documentation
- Cookbook Examples
- Claude 3 Opus/Sonnet/Haiku with advanced reasoning capabilities
Cohere Tool Use
- Tool Use Overview
- Tool Use Guide
- Multi-step tool planning with citation support
Gemini Function Calling
- Function Calling Guide
- Supports parallel calls and multimodal tool use
Llama Tool Use
- Tool Calling Guide
- Llama 3.1+ models with built-in tool calling support

Open-Source Frameworks:

LangChain
- LangChain Documentation
- Tool Calling Guide
HuggingFace Transformers
- Chat Templating with Tools
- SmolAgents Framework
- Tool-calling with open models (Qwen, SmolLM, Llama)

Key Research Papers

ReAct: Synergizing Reasoning and Acting in Language Models³: Introduces the ReAct pattern (reasoning + acting) for agentic workflows.
Toolformer: Language Models Can Teach Themselves to Use Tools⁴: Shows LLMs can learn when and how to use tools via self-supervised learning.
Gorilla: Large Language Model Connected with Massive APIs⁵: Fine-tunes LLMs for API calling with retrieval-augmented generation.
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs⁶: Creates ToolBench dataset for evaluating tool use across diverse APIs.

Function Calling Patterns

Single-Step Function Calling

Pattern: LLM analyzes query → selects tool → generates arguments → you execute → LLM synthesizes final answer.

Use cases: Simple lookups, single calculations, straightforward API calls.

Example flow:

User: "What's the weather in Tokyo?"
LLM: [calls get_weather(location="Tokyo")]
System: [executes, returns {temp: 22, condition: "sunny"}]
LLM: "The weather in Tokyo is sunny with a temperature of 22°C."

Parallel Function Calling

Pattern: LLM calls multiple tools simultaneously when they don’t depend on each other.

Use cases: Aggregating information from multiple sources, batch operations.

Example:

User: "Compare weather in Tokyo and Paris"
LLM: [calls get_weather(location="Tokyo"), get_weather(location="Paris")]
System: [executes both in parallel]
LLM: [synthesizes comparison]

Sequential Function Calling

Pattern: Use output from one tool call as input to another.

Use cases: Multi-step workflows where steps depend on previous results.

Example:

User: "Search for Python creator and tell me their age in 2024"
LLM: [calls search("Python creator")]
System: [returns "Guido van Rossum, born 1956"]
LLM: [calls calculate("2024 - 1956")]
System: [returns 68]
LLM: "Guido van Rossum, Python's creator, was 68 years old in 2024."

Tool Design Best Practices

1. Clear Tool Descriptions

# Good: Specific, explicit constraints
{
    "name": "search_arxiv",
    "description": "Search arXiv for academic papers. Returns title, authors, abstract, and PDF link. Use for recent research papers (physics, CS, math). Limit results to 5.",
    "parameters": {...}
}

# Bad: Vague, ambiguous
{
    "name": "search",
    "description": "Search for stuff",
    "parameters": {...}
}

2. Robust Parameter Schemas

Use enums for categorical choices
Mark fields as required or optional explicitly
Provide examples in descriptions
Use type constraints (integer, string, array, etc.)
Add validation rules (min/max, regex patterns)

3. Error Handling & Feedback

def get_weather(location: str) -> dict:
    try:
        result = api.fetch_weather(location)
        return {"success": True, "data": result}
    except LocationNotFound:
        return {
            "success": False, 
            "error": f"Location '{location}' not found. Try 'City, Country' format."
        }
    except APIError as e:
        return {
            "success": False,
            "error": f"Weather API error: {str(e)}. Try again later."
        }

Key: Return structured errors that help the LLM retry with corrections.

4. Tool Composability

Design tools to work together:

search → extract → summarize
translate → validate → store
fetch_data → analyze → visualize

5. Rate Limits & Cost Control

Implement retry logic with exponential backoff
Set max iterations for agent loops (prevent infinite loops)
Track tool usage and costs per session
Use caching for repeated identical calls

Implementation Examples

OpenAI Function Calling (Basic)

import openai

# Define tool
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g., 'London' or 'New York, NY'"
                    },
                    "units": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "default": "celsius"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Call with tool
response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
    tool_choice="auto"  # or "required" or {"type": "function", "function": {"name": "get_weather"}}
)

# Check if tool was called
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments)
  
    # Execute tool (your code)
    result = get_weather(**function_args)
  
    # Send result back to model
    messages = [
        {"role": "user", "content": "What's the weather in Paris?"},
        response.choices[0].message,
        {
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        }
    ]
    final_response = openai.chat.completions.create(
        model="gpt-4",
        messages=messages
    )
    print(final_response.choices[0].message.content)

LangChain ReAct Agent

from langchain.agents import create_react_agent, AgentExecutor
from langchain_openai import ChatOpenAI
from langchain.tools import Tool
from langchain import hub

# Define tools
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression."""
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {str(e)}"

def search_wikipedia(query: str) -> str:
    """Search Wikipedia for information."""
    # Implement actual Wikipedia API call
    return f"Wikipedia results for: {query}"

tools = [
    Tool(
        name="Calculator",
        func=calculate,
        description="Useful for mathematical calculations. Input should be a valid Python expression."
    ),
    Tool(
        name="WikipediaSearch",
        func=search_wikipedia,
        description="Search Wikipedia for factual information. Input should be a search query."
    )
]

# Create agent
llm = ChatOpenAI(model="gpt-4", temperature=0)
prompt = hub.pull("hwchase17/react")  # Standard ReAct prompt template
agent = create_react_agent(llm, tools, prompt)

# Execute
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    handle_parsing_errors=True,
    max_iterations=5  # Prevent infinite loops
)

result = agent_executor.invoke({
    "input": "What is 25% of 300? Also, when was Python created?"
})
print(result["output"])

Anthropic Tool Use (Claude)

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

# Define tools
tools = [
    {
        "name": "get_stock_price",
        "description": "Get current stock price for a company ticker symbol",
        "input_schema": {
            "type": "object",
            "properties": {
                "ticker": {
                    "type": "string",
                    "description": "Stock ticker symbol (e.g., AAPL, GOOGL)"
                }
            },
            "required": ["ticker"]
        }
    }
]

# Initial request
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the current price of Apple stock?"}]
)

# Check for tool use
if message.stop_reason == "tool_use":
    tool_use_block = next(block for block in message.content if block.type == "tool_use")
    tool_name = tool_use_block.name
    tool_input = tool_use_block.input
  
    # Execute tool
    result = get_stock_price(tool_input["ticker"])
  
    # Continue conversation
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        tools=tools,
        messages=[
            {"role": "user", "content": "What's the current price of Apple stock?"},
            {"role": "assistant", "content": message.content},
            {
                "role": "user",
                "content": [
                    {
                        "type": "tool_result",
                        "tool_use_id": tool_use_block.id,
                        "content": json.dumps(result)
                    }
                ]
            }
        ]
    )
    print(response.content[0].text)

Open-Source Model with SmolAgents (HuggingFace)

from smolagents import CodeAgent, DuckDuckGoSearchTool, PythonInterpreterTool, HfApiModel

# Initialize model and tools
model = HfApiModel(model_id="Qwen/Qwen2.5-72B-Instruct")
agent = CodeAgent(
    tools=[DuckDuckGoSearchTool(), PythonInterpreterTool()],
    model=model,
    max_steps=10
)

# Run agent
result = agent.run(
    "Search for the latest Python release date, then calculate how many days ago it was."
)
print(result)

Evaluation & Best Practices

Key Metrics

Tool Selection Accuracy: Were the right tools chosen?
Argument Correctness: Were tool parameters valid and appropriate?
Response Quality: Does the final answer correctly use tool results?
Latency: Total time including tool execution
Cost: API calls, token usage, tool invocation costs
Error Handling: Recovery from invalid inputs or tool failures

Common Issues & Solutions

Issue	Cause	Solution
Wrong tool selected	Ambiguous descriptions	Improve tool descriptions, add examples
Invalid parameters	Schema mismatch	Stricter validation, better type hints, add constraints
Tool failures	API errors, timeouts	Return structured errors with retry suggestions
Hallucinated data	No validation	Validate tool outputs before returning to LLM
Cost overruns	Excessive calls	Set rate limits, cache repeated calls

Debugging Tips

# Log all tool calls for inspection
def logged_tool(func):
    def wrapper(*args, **kwargs):
        print(f"Calling {func.__name__} with {args}, {kwargs}")
        result = func(*args, **kwargs)
        print(f"Result: {result}")
        return result
    return wrapper

@logged_tool
def get_weather(location: str) -> dict:
    # implementation
    pass

Programming Practice

Core Exercises

1. Single-Step Function Calling

Objective: Implement basic OpenAI/Anthropic function calling.

Tasks:

Define 3 tools: get_weather, calculate, search_wikipedia
Write tool execution functions with error handling
Build request-response loop: query → tool call → execution → final answer
Test with 10 diverse queries covering all tools
Measure: tool selection accuracy, parameter correctness, response quality

Starter code:

import openai

def get_weather(location: str, units: str = "celsius") -> dict:
    # Implement (mock or real API)
    pass

def calculate(expression: str) -> dict:
    # Implement safe eval
    pass

def search_wikipedia(query: str) -> dict:
    # Implement Wikipedia API call
    pass

# Define tool schemas
tools = [...]  # Fill in schemas

# Test queries
queries = [
    "What's the weather in Tokyo?",
    "Calculate 15% of 450",
    "When was Albert Einstein born?",
    # Add 7 more diverse queries
]

# Implement and evaluate

2. Parallel Function Calling

Objective: Implement parallel tool calls for efficiency.

Tasks:

Use OpenAI’s parallel function calling feature
Test queries requiring multiple independent tools
Compare latency: parallel vs. sequential
Handle partial failures (some tools succeed, others fail)

Example queries:

“Compare weather in Tokyo and Paris”
“Get stock prices for AAPL, GOOGL, and MSFT”

3. Sequential Tool Calling

Objective: Chain tools where output of one feeds into another.

Tasks:

Implement 2-3 step workflows
Handle intermediate results and error propagation
Track reasoning: why each tool was called

Example queries:

“Search for Python creator and calculate their age in 2024”
“Find the population of France and divide by 10”

4. Custom Tool Design

Objective: Design and integrate a domain-specific tool.

Options:

Academic: Search arXiv, fetch paper metadata
Financial: Get stock prices, calculate portfolio value
Data analysis: Load CSV, compute basic statistics
Web: Fetch and parse webpage content

Requirements:

Clear tool schema with validation
Robust error handling
Unit tests for tool function
5 test cases demonstrating value

5. Error Handling

Objective: Handle tool failures gracefully.

Tasks:

Implement tools that occasionally fail (simulate API errors, timeouts)
Return structured error messages
Test LLM’s ability to retry with corrected parameters
Measure: recovery success rate

Failure injection:

def unreliable_tool(param: str) -> dict:
    if random.random() < 0.3:  # 30% failure rate
        return {"success": False, "error": "Service temporarily unavailable. Try again."}
    return {"success": True, "result": ...}

Assessment Rubric

Criterion	Points	Description
Tool Implementation	0-3	Correct schemas, error handling, documentation
Function Calling	0-3	Proper integration with LLM API, handles responses
Error Handling	0-2	Graceful failures, informative error messages
Testing	0-2	Diverse test cases, measures accuracy
Code Quality	0-2	Clean code, modular design
Documentation	0-1	Clear explanations, examples

Total: 13 points. Adapt based on project scope.

Additional Resources

Tool Libraries & Examples

LangChain Community Tools: 100+ pre-built tools
HuggingFace Tools: Transformers toolkit
Anthropic Cookbook: Claude tool use examples
OpenAI Cookbook: Function calling examples and best practices

Benchmarks

ToolBench: 16,000+ real-world APIs for evaluation
API-Bank: Tool-augmented LLM benchmark⁷

Note: For agent frameworks (LangGraph, AutoGen, CrewAI), planning strategies, and multi-step reasoning, see Week 11: Agents & Planning.

Web Tools

WebArena: Realistic web task environment

Tutorials & Courses

LangChain Academy: Free courses on agents and tools
DeepLearning.AI Agent Courses: Multiple short courses on AI agents
Anthropic Cookbook: Claude tool use examples
OpenAI Cookbook: Function calling examples and best practices

References

OpenAI. (2023). Function Calling and Other API Updates. OpenAI Blog. Retrieved from https://openai.com/blog/function-calling-and-other-api-updates ↩︎
Anthropic. (2024). Tool Use (Function Calling). Anthropic Documentation. Retrieved from https://docs.anthropic.com/en/docs/build-with-claude/tool-use ↩︎
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR). arXiv:2210.03629. ↩︎
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., … & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems, 36. arXiv:2302.04761. ↩︎
Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv preprint. arXiv:2305.15334. ↩︎
Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., … & Sun, M. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv preprint. arXiv:2307.16789. ↩︎
Li, M., Song, F., Yu, B., Yu, H., Li, Z., Huang, F., & Li, Y. (2023). API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. arXiv:2304.08244. ↩︎