Claude vs GPT in Agentic Systems: A Function Calling Comparison · Kadoa

Anthropic's recent announcement of tool use (function calls) caught my attention, specifically their claim that the Claude models can correctly handle 250+ tools with >90% accuracy. I've been working with GPT function calling for a while and noticed that the recall for larger and more complex functions is quite low.

So, I decided to compare GPT and Claude's performance in using different tools for tasks like web scraping and browser automation.

To do your own comparison and run your own scenarios, check out the GitHub repository with all the code.

Evaluation Methodology

To ensure a fair and comprehensive comparison, I created a test scenario that covers a wide range of simulated web scraping and browser automation tasks and tools.

I defined the expected tools/functions to be used and the desired output for the test case:

  {
        "query": "You are an RPA bot. If you're missing a CSS selector, you need to call the find_selector tool by providing the description. To find a specific webpage by description, call the find_page tool. Here is your task: Log in to https://example.com using the provided credentials. Navigate to the 'Products' page and extract the names and prices of all products that are currently in stock. For each product, check if there is a detailed specification PDF available by hovering over the 'Info' button and extracting the link. If a PDF is available, download it and extract the table of technical specifications. Finally, upload the parsed technical specifications to the file server.",
        "expectedTools": [
            "handle_login",
            "navigate_to_url",
            "extract_text",
            "hover_element",
            "extract_attribute",
            "download_and_parse_pdf",
            "extract_specs_table",
            "upload_to_file_server"
        ],
        "expectedLastStep": "upload_to_file_server",
        "parameters": {
            "login_url": "https://example.com/login",
            "submit_selector": "#login-button",
            "username": "testuser",
            "password": "testpassword"
        }
    }

These tasks include:

Logging in to websites
Navigating to specific pages
Extracting data from HTML elements
Interacting with UI components (buttons, dropdowns, etc.)
Handling pagination and infinite scroll
Downloading and parsing files

Then I defined 25 available tools that the agent can use to solve these different tasks in the test scenario. The tool definitions look like this:

  {
        "name": "click_element",
        "description": "Click on an element on the current web page.",
        "input_schema": {
            "type": "object",
            "properties": {
                "selector": {
                    "type": "string",
                    "description": "The CSS selector for the element to click, e.g., #submit-button"
                }
            },
            "required": ["selector"]
        },
        "function": async function (args) {
            console.log(`Clicking element ${args}`);
            return {success: true}
        }
    }

I then ran the test case against both Claude and GPT, tracking the tools they actually used and comparing their output to the expected results.

Results and Analysis

After running the test case, I analyzed the performance of both models.

Metric	claude-3-opus-20240229	gpt-4-0125-preview	claude-3-sonnet-20240229	gpt-3.5-turbo-0125
Avg Tool Calls	16	13	11	9
Avg Accuracy	100%	81.25%	87.5%	79.17%
Avg Costs	$0.807255	$0.153540	$0.119638	$0.008145

Tool Selection Accuracy

Claude Opus consistently selected the correct tools based on the user query, achieving an impressive 100% accuracy.
Claude Sonnet also performed well, with an accuracy.
GPT-4 sometimes struggled with complex tool selection.
GPT-3.5 had the lowest accuracy, often selecting inappropriate tools for the given tasks.

Handling Complex Tools

Claude Opus and Claude Sonnet successfully parsed and utilized complex tools with nested parameters. Its high performance comes at a premium price.
GPT-4 had some difficulty dealing with complex tool parameters, occasionally failing to properly interpret and use nested objects.
GPT-3.5 struggled the most with complex tools, often misinterpreting or misusing nested parameters.

Cost Comparison

Claude Opus used the most tokens per completed test case
Claude Sonnet offers a good balance between performance
GPT-4 has a somewhat comparable cost to Claude Sonnet
GPT-3.5 is by far the cheapest model

Robustness and Production-Readiness

During the evaluation, Claude calls occasionally resulted in API errors, which makes it not as robust and production-ready as the GPT models. This is understandable as Claude's tool usage feature is still in public beta. In most cases, an automatic retry solved the issue.

Conclusion

This evaluation shows Claude's superiority over GPT in tool use for agentic systems. Claude's high accuracy in selecting and utilizing tools was impressive and is a big step forward towards reliable and production-ready agent systems.

AI agents still work best for simple, well-constrained tasks.
To create a successful agent, you need to provide it with good tools. The LLM can then figure out the correct sequence of tool calls itself, which feel like a promising direction.
Tool use is still quite slow and often very expensive. I've spend around $50 just on experimenting with Claude for one day. Imagine what the testing would cost for a production-scale system. Making the unit economics work is difficult but will improve as LLM costs continue to drop.

As both Claude and GPT continue to evolve, it will be fascinating to observe how their agent capabilities advance and compare in the future.