Week 5: Tool Use with LLMs

Introduction

Large Language Models (LLMs) excel at language generation, but their capabilities are fundamentally limited by their training data cutoff and inability to access real-time information or perform precise computations. Tool use (also called function calling or tool augmentation) enables LLMs to overcome these limitations by interfacing with external systems: search engines, calculators, APIs, databases, and code interpreters. This week focuses on the fundamentals of tool calling: how LLMs are configured to select and invoke external tools, how to design effective tool schemas, and how to implement basic function calling with major providers (OpenAI, Anthropic, Cohere) and open-source models. Note: This week introduces tool calling mechanics—we’ll explore advanced agent architectures, planning, and multi-step reasoning in Week 11: Agents & Planning.

Goals for the Week

  • Understand why tool use is necessary and when to use it vs. direct prompting.
  • Learn the mechanics of function calling: tool schemas, parameter passing, result handling.
  • Implement single-step and basic multi-step tool calling with OpenAI, Anthropic, and open-source models.
  • Design clear tool descriptions and robust parameter schemas.
  • Handle errors, validate outputs, and implement retry logic.
  • Evaluate tool usage: selection accuracy, parameter correctness, response quality.

Learning Guide

Why Tool Use Matters

Limitations of Pure LLMs:

  • Knowledge Cutoff: Training data is frozen; cannot access current events or real-time data.
  • Computational Weakness: Struggle with precise arithmetic, complex calculations, symbolic reasoning.
  • Hallucination Risk: May confidently generate plausible but incorrect information.
  • No External State: Cannot read files, query databases, or interact with APIs.

Benefits of Tool-Augmented LLMs:

  • Grounding: Retrieve factual information from authoritative sources (search, databases, knowledge graphs).
  • Precision: Delegate calculations to deterministic tools (calculators, code interpreters).
  • Actuation: Perform real-world actions (send emails, update databases, control systems).
  • Specialization: Access domain-specific tools (weather APIs, financial data, medical databases).

Core Concepts

Function Calling: The Basics

Function calling is an API-level feature where LLMs output structured function calls (JSON) that your code executes1.

Basic Flow:

  1. Define Tools: Specify available functions with schemas (name, description, parameters)
  2. Send Query: User asks a question requiring external information
  3. Model Decides: LLM determines which tool(s) to call and with what arguments
  4. Execute Tool: Your code runs the actual function
  5. Return Result: Send tool output back to LLM
  6. Generate Answer: LLM synthesizes final response using tool results

Tool Schema Design

Tools are described via structured schemas (function signatures):

{
  "name": "get_weather",
  "description": "Get current weather for a location",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name, e.g., 'London' or 'New York, NY'"
      },
      "units": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"],
        "description": "Temperature units"
      }
    },
    "required": ["location"]
  }
}

Best practices: Clear descriptions, explicit types, required fields, enums for constraints.

Videos & Interactive Tutorials

Recommended Courses:

API Documentation & Guides

Major Provider APIs:

Open-Source Frameworks:

Key Research Papers

  • ReAct: Synergizing Reasoning and Acting in Language Models3: Introduces the ReAct pattern (reasoning + acting) for agentic workflows.
  • Toolformer: Language Models Can Teach Themselves to Use Tools4: Shows LLMs can learn when and how to use tools via self-supervised learning.
  • Gorilla: Large Language Model Connected with Massive APIs5: Fine-tunes LLMs for API calling with retrieval-augmented generation.
  • ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs6: Creates ToolBench dataset for evaluating tool use across diverse APIs.

Function Calling Patterns

Single-Step Function Calling

Pattern: LLM analyzes query → selects tool → generates arguments → you execute → LLM synthesizes final answer.

Use cases: Simple lookups, single calculations, straightforward API calls.

Example flow:

User: "What's the weather in Tokyo?"
LLM: [calls get_weather(location="Tokyo")]
System: [executes, returns {temp: 22, condition: "sunny"}]
LLM: "The weather in Tokyo is sunny with a temperature of 22°C."

Parallel Function Calling

Pattern: LLM calls multiple tools simultaneously when they don’t depend on each other.

Use cases: Aggregating information from multiple sources, batch operations.

Example:

User: "Compare weather in Tokyo and Paris"
LLM: [calls get_weather(location="Tokyo"), get_weather(location="Paris")]
System: [executes both in parallel]
LLM: [synthesizes comparison]

Sequential Function Calling

Pattern: Use output from one tool call as input to another.

Use cases: Multi-step workflows where steps depend on previous results.

Example:

User: "Search for Python creator and tell me their age in 2024"
LLM: [calls search("Python creator")]
System: [returns "Guido van Rossum, born 1956"]
LLM: [calls calculate("2024 - 1956")]
System: [returns 68]
LLM: "Guido van Rossum, Python's creator, was 68 years old in 2024."

Tool Design Best Practices

1. Clear Tool Descriptions

# Good: Specific, explicit constraints
{
    "name": "search_arxiv",
    "description": "Search arXiv for academic papers. Returns title, authors, abstract, and PDF link. Use for recent research papers (physics, CS, math). Limit results to 5.",
    "parameters": {...}
}

# Bad: Vague, ambiguous
{
    "name": "search",
    "description": "Search for stuff",
    "parameters": {...}
}

2. Robust Parameter Schemas

  • Use enums for categorical choices
  • Mark fields as required or optional explicitly
  • Provide examples in descriptions
  • Use type constraints (integer, string, array, etc.)
  • Add validation rules (min/max, regex patterns)

3. Error Handling & Feedback

def get_weather(location: str) -> dict:
    try:
        result = api.fetch_weather(location)
        return {"success": True, "data": result}
    except LocationNotFound:
        return {
            "success": False, 
            "error": f"Location '{location}' not found. Try 'City, Country' format."
        }
    except APIError as e:
        return {
            "success": False,
            "error": f"Weather API error: {str(e)}. Try again later."
        }

Key: Return structured errors that help the LLM retry with corrections.

4. Tool Composability

Design tools to work together:

  • searchextractsummarize
  • translatevalidatestore
  • fetch_dataanalyzevisualize

5. Rate Limits & Cost Control

  • Implement retry logic with exponential backoff
  • Set max iterations for agent loops (prevent infinite loops)
  • Track tool usage and costs per session
  • Use caching for repeated identical calls

Implementation Examples

OpenAI Function Calling (Basic)

import openai

# Define tool
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name, e.g., 'London' or 'New York, NY'"
                    },
                    "units": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "default": "celsius"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Call with tool
response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
    tool_choice="auto"  # or "required" or {"type": "function", "function": {"name": "get_weather"}}
)

# Check if tool was called
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments)
  
    # Execute tool (your code)
    result = get_weather(**function_args)
  
    # Send result back to model
    messages = [
        {"role": "user", "content": "What's the weather in Paris?"},
        response.choices[0].message,
        {
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": json.dumps(result)
        }
    ]
    final_response = openai.chat.completions.create(
        model="gpt-4",
        messages=messages
    )
    print(final_response.choices[0].message.content)

LangChain ReAct Agent

from langchain.agents import create_react_agent, AgentExecutor
from langchain_openai import ChatOpenAI
from langchain.tools import Tool
from langchain import hub

# Define tools
def calculate(expression: str) -> str:
    """Evaluate a mathematical expression."""
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {str(e)}"

def search_wikipedia(query: str) -> str:
    """Search Wikipedia for information."""
    # Implement actual Wikipedia API call
    return f"Wikipedia results for: {query}"

tools = [
    Tool(
        name="Calculator",
        func=calculate,
        description="Useful for mathematical calculations. Input should be a valid Python expression."
    ),
    Tool(
        name="WikipediaSearch",
        func=search_wikipedia,
        description="Search Wikipedia for factual information. Input should be a search query."
    )
]

# Create agent
llm = ChatOpenAI(model="gpt-4", temperature=0)
prompt = hub.pull("hwchase17/react")  # Standard ReAct prompt template
agent = create_react_agent(llm, tools, prompt)

# Execute
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    handle_parsing_errors=True,
    max_iterations=5  # Prevent infinite loops
)

result = agent_executor.invoke({
    "input": "What is 25% of 300? Also, when was Python created?"
})
print(result["output"])

Anthropic Tool Use (Claude)

import anthropic

client = anthropic.Anthropic(api_key="your-api-key")

# Define tools
tools = [
    {
        "name": "get_stock_price",
        "description": "Get current stock price for a company ticker symbol",
        "input_schema": {
            "type": "object",
            "properties": {
                "ticker": {
                    "type": "string",
                    "description": "Stock ticker symbol (e.g., AAPL, GOOGL)"
                }
            },
            "required": ["ticker"]
        }
    }
]

# Initial request
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the current price of Apple stock?"}]
)

# Check for tool use
if message.stop_reason == "tool_use":
    tool_use_block = next(block for block in message.content if block.type == "tool_use")
    tool_name = tool_use_block.name
    tool_input = tool_use_block.input
  
    # Execute tool
    result = get_stock_price(tool_input["ticker"])
  
    # Continue conversation
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        tools=tools,
        messages=[
            {"role": "user", "content": "What's the current price of Apple stock?"},
            {"role": "assistant", "content": message.content},
            {
                "role": "user",
                "content": [
                    {
                        "type": "tool_result",
                        "tool_use_id": tool_use_block.id,
                        "content": json.dumps(result)
                    }
                ]
            }
        ]
    )
    print(response.content[0].text)

Open-Source Model with SmolAgents (HuggingFace)

from smolagents import CodeAgent, DuckDuckGoSearchTool, PythonInterpreterTool, HfApiModel

# Initialize model and tools
model = HfApiModel(model_id="Qwen/Qwen2.5-72B-Instruct")
agent = CodeAgent(
    tools=[DuckDuckGoSearchTool(), PythonInterpreterTool()],
    model=model,
    max_steps=10
)

# Run agent
result = agent.run(
    "Search for the latest Python release date, then calculate how many days ago it was."
)
print(result)

Evaluation & Best Practices

Key Metrics

  1. Tool Selection Accuracy: Were the right tools chosen?
  2. Argument Correctness: Were tool parameters valid and appropriate?
  3. Response Quality: Does the final answer correctly use tool results?
  4. Latency: Total time including tool execution
  5. Cost: API calls, token usage, tool invocation costs
  6. Error Handling: Recovery from invalid inputs or tool failures

Common Issues & Solutions

Issue Cause Solution
Wrong tool selected Ambiguous descriptions Improve tool descriptions, add examples
Invalid parameters Schema mismatch Stricter validation, better type hints, add constraints
Tool failures API errors, timeouts Return structured errors with retry suggestions
Hallucinated data No validation Validate tool outputs before returning to LLM
Cost overruns Excessive calls Set rate limits, cache repeated calls

Debugging Tips

# Log all tool calls for inspection
def logged_tool(func):
    def wrapper(*args, **kwargs):
        print(f"Calling {func.__name__} with {args}, {kwargs}")
        result = func(*args, **kwargs)
        print(f"Result: {result}")
        return result
    return wrapper

@logged_tool
def get_weather(location: str) -> dict:
    # implementation
    pass

Programming Practice

Core Exercises

1. Single-Step Function Calling

Objective: Implement basic OpenAI/Anthropic function calling.

Tasks:

  • Define 3 tools: get_weather, calculate, search_wikipedia
  • Write tool execution functions with error handling
  • Build request-response loop: query → tool call → execution → final answer
  • Test with 10 diverse queries covering all tools
  • Measure: tool selection accuracy, parameter correctness, response quality

Starter code:

import openai

def get_weather(location: str, units: str = "celsius") -> dict:
    # Implement (mock or real API)
    pass

def calculate(expression: str) -> dict:
    # Implement safe eval
    pass

def search_wikipedia(query: str) -> dict:
    # Implement Wikipedia API call
    pass

# Define tool schemas
tools = [...]  # Fill in schemas

# Test queries
queries = [
    "What's the weather in Tokyo?",
    "Calculate 15% of 450",
    "When was Albert Einstein born?",
    # Add 7 more diverse queries
]

# Implement and evaluate

2. Parallel Function Calling

Objective: Implement parallel tool calls for efficiency.

Tasks:

  • Use OpenAI’s parallel function calling feature
  • Test queries requiring multiple independent tools
  • Compare latency: parallel vs. sequential
  • Handle partial failures (some tools succeed, others fail)

Example queries:

  • “Compare weather in Tokyo and Paris”
  • “Get stock prices for AAPL, GOOGL, and MSFT”

3. Sequential Tool Calling

Objective: Chain tools where output of one feeds into another.

Tasks:

  • Implement 2-3 step workflows
  • Handle intermediate results and error propagation
  • Track reasoning: why each tool was called

Example queries:

  • “Search for Python creator and calculate their age in 2024”
  • “Find the population of France and divide by 10”

4. Custom Tool Design

Objective: Design and integrate a domain-specific tool.

Options:

  • Academic: Search arXiv, fetch paper metadata
  • Financial: Get stock prices, calculate portfolio value
  • Data analysis: Load CSV, compute basic statistics
  • Web: Fetch and parse webpage content

Requirements:

  • Clear tool schema with validation
  • Robust error handling
  • Unit tests for tool function
  • 5 test cases demonstrating value

5. Error Handling

Objective: Handle tool failures gracefully.

Tasks:

  • Implement tools that occasionally fail (simulate API errors, timeouts)
  • Return structured error messages
  • Test LLM’s ability to retry with corrected parameters
  • Measure: recovery success rate

Failure injection:

def unreliable_tool(param: str) -> dict:
    if random.random() < 0.3:  # 30% failure rate
        return {"success": False, "error": "Service temporarily unavailable. Try again."}
    return {"success": True, "result": ...}

Assessment Rubric

Criterion Points Description
Tool Implementation 0-3 Correct schemas, error handling, documentation
Function Calling 0-3 Proper integration with LLM API, handles responses
Error Handling 0-2 Graceful failures, informative error messages
Testing 0-2 Diverse test cases, measures accuracy
Code Quality 0-2 Clean code, modular design
Documentation 0-1 Clear explanations, examples

Total: 13 points. Adapt based on project scope.

Additional Resources

Tool Libraries & Examples

Benchmarks

  • ToolBench: 16,000+ real-world APIs for evaluation
  • API-Bank: Tool-augmented LLM benchmark7

Note: For agent frameworks (LangGraph, AutoGen, CrewAI), planning strategies, and multi-step reasoning, see Week 11: Agents & Planning.

Web Tools

  • WebArena: Realistic web task environment

Tutorials & Courses

References


  1. OpenAI. (2023). Function Calling and Other API Updates. OpenAI Blog. Retrieved from https://openai.com/blog/function-calling-and-other-api-updates ↩︎

  2. Anthropic. (2024). Tool Use (Function Calling). Anthropic Documentation. Retrieved from https://docs.anthropic.com/en/docs/build-with-claude/tool-use ↩︎

  3. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR). arXiv:2210.03629↩︎

  4. Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., … & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. Advances in Neural Information Processing Systems, 36. arXiv:2302.04761↩︎

  5. Patil, S. G., Zhang, T., Wang, X., & Gonzalez, J. E. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv preprint. arXiv:2305.15334↩︎

  6. Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., … & Sun, M. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. arXiv preprint. arXiv:2307.16789↩︎

  7. Li, M., Song, F., Yu, B., Yu, H., Li, Z., Huang, F., & Li, Y. (2023). API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. arXiv:2304.08244↩︎