The Claude 3.5 Sonnet: An Anthropic-Like Model for Agentic Coding and Tool Use (Extended Abstract)
Improvements in agentic coding and tool use tasks are shown in the updated Claude 3.5 Sonnet. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It increases the performance of the TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain and from 36.1% to 46.0% in the more challenging airline domain.
The new Claude 3.5 Sonnet model has a number of improvements and is the same price as its predecessor, Anthropic says.
Also, this version of Claude has apparently been told to steer clear of social media, with “measures to monitor when Claude is asked to engage in election-related activity, as well as systems for nudging Claude away from activities like generating and posting content on social media, registering web domains, or interacting with government websites.”
There are many actions that people routinely do with computers (dragging, zooming, and so on) that Claude can’t yet attempt. The “flipbook” nature of Claude’s view of the screen—taking screenshots and piecing them together, rather than observing a more granular video stream—means that it can miss short-lived actions or notifications.
Anthropic cautions that computer use is still experimental and can lead to errors. The company says, “We’re releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time.”
Microsoft’s Copilot Vision feature and OpenAI’s desktop app for ChatGPT have shown what their AI tools can do based on seeing your computer’s screen, and Google has similar capabilities in its Gemini app on Android phones. But they haven’t gone to the next step of widely releasing tools ready to click around and perform tasks for you like this. Rabbit promised similar capabilities for its R1, which it has yet to deliver.
A new feature in Anthropic’s Claude model allows you to control a computer from the comfort of your own home, while looking at a screen. The new feature, called ” computer use,” is now available on the API and allows developers to direct Claude to work on a computer like a human does.
It took a while for people to adjust to the idea of chatbots that seem to have minds of their own. We might be able to trust artificial intelligence to take over our computers.
“I think we’re going to enter into a new era where a model can use all of the tools that you use as a person to get tasks done,” says Jared Kaplan, chief science officer at Anthropic and an associate professor at Johns Hopkins University.
In the demo Kaplan showed WIRED, Claude was asked to help plan an outing to see the sunrise at the Golden Gate Bridge with a friend. In response to the prompt, Claude opened the Chrome web browser, looked up relevant information on Google, including the ideal viewing spot and the optimal time to be there, then used a calendar app to create an event to share with a friend. (It did not include further instructions, such as what route to take to get there in the least amount of time.)
In a second demo, Claude was asked to build a simple website to promote itself. In a surreal moment, the model inputted a text prompt into its own web interface to generate the necessary code. It then used Visual Studio Code, a popular code editor developed by Microsoft, to write a simple website, and opened a text terminal to spin up a simple web server to test the site. The website offered a decent, 1990s-themed landing page for the AI model. The model returned back to the editor and deleted the offending portion of the code after a user asked for it to fix a problem on the website.
Mike Krieger, chief product officer at Anthropic, says the company hopes that so-called AI agents will automate routine office tasks and free people up to be more productive in other areas. “What would you do if you got rid of a bunch of hours of copy and pasting or whatever you end up doing?” He said so. “I’d go and play more guitar.”
Anthropic is making the agentic abilities available through its application programming interface (API) for its most powerful multimodal large language model, Claude 3.5 Sonnet, from today. The company also announced a new and improved version of a smaller model, Claude 3.5 Haiku, today.