Claude's "Computer Usage" Feature Sparks Discussion, New Research Reveals Current GUI Agent Capabilities AI NEWS

Home
AInews
Claude's "Computer Usage" Feature Sparks Discussion, New Research Reveals Current GUI Agent Capabilities

Claude's "Computer Usage" Feature Sparks Discussion, New Research Reveals Current GUI Agent Capabilities

2024-11-21

Since Anthropic launched the "Computer Use" feature for Claude in October, the capability of AI agents to simulate human interactions has garnered widespread attention. Recently, the Show Lab at the National University of Singapore conducted a new study providing a comprehensive overview of the expected capabilities of the current generation of Graphical User Interface (GUI) agents.

As the first cutting-edge model capable of interacting with devices through the same interfaces as humans, Claude operates solely based on desktop screenshots and interacts by triggering keyboard and mouse actions. This feature promises users the ability to automate tasks through simple commands without the need to access application APIs.

Researchers tested Claude across various tasks, including web searches, workflow completion, office efficiency, and video gaming. In web search tasks, Claude was required to browse and interact with websites, such as searching for and purchasing products or subscribing to news services. Workflow tasks involved interactions across multiple applications, such as extracting information from websites and inserting it into spreadsheets. Office efficiency tasks assessed the agent's ability to perform common operations like formatting documents, sending emails, and creating presentations. Video gaming tasks evaluated the agent's capability to execute multi-step tasks that require understanding game logic and planning actions.

The testing comprehensively evaluated the model's abilities from three dimensions: planning, execution, and evaluation. First, the model must develop a coherent plan to complete the task. Next, it needs to translate each step into specific actions, such as opening a browser, clicking elements, and entering text. Finally, the evaluation component determines whether the model can assess its progress and success during task completion. If the model makes an error, it should be able to correct it; if the task cannot be completed, it should provide a reasonable explanation. Researchers created a framework based on these three components and had humans review and rate all the tests.

Overall, Claude performed excellently in executing complex tasks. It was able to reason and plan the multiple steps required to complete tasks, perform corresponding actions, and evaluate progress at each step. It could also coordinate between different applications, such as copying information from a webpage and pasting it into a spreadsheet. Additionally, in some cases, it would recheck the results at the end of a task to ensure everything aligned with the objectives. The model's reasoning trajectory indicated a general understanding of how different tools and applications operate and effective coordination among them.

However, Claude also made some minor errors that ordinary users could easily avoid. For example, in one task, the model failed to complete a subscription because it did not scroll the webpage to find the corresponding button. In other instances, it failed at very simple and straightforward tasks, such as selecting and replacing text or changing bullet points to numbers. Moreover, the model either did not recognize its mistakes or made incorrect assumptions about the reasons for not achieving the expected goals.

Researchers noted that the model's misjudgment of its progress highlights the "insufficiency of the model's self-assessment mechanism" and suggested that "fully addressing this issue may still require improvements to the GUI agent framework, such as an internal rigorous critique module." From the results, GUI agents cannot replicate all the fundamental nuances of human computer usage.

The promise of automating tasks using basic text descriptions is highly attractive for businesses. However, at least for now, this technology is not ready for large-scale deployment. The model's behavior is unstable, which may lead to unpredictable outcomes, potentially causing serious consequences in sensitive applications. Executing operations through interfaces designed for humans is also not the fastest way to complete tasks that APIs can handle.

There is still much to learn regarding the security risks of granting large language models (LLMs) control over mice and keyboards. For example, a study has shown that web agents are highly susceptible to adversarial attacks that humans can easily overlook.

Nevertheless, tools like Claude's "Computer Use" still hold value. They can help product teams explore ideas and iterate on different solutions without spending time and money developing new features or services to automate tasks. Once a viable solution is found, teams can focus on developing the necessary code and components to deliver efficiently and reliably. Large-scale task automation still requires robust infrastructure, including APIs and microservices that can connect securely and provide large-scale services. In the future, as technology continues to advance and improve, we have reason to believe that GUI agents will play an important role in more areas.

MINT AI

AI agents for optimizing advertising campaigns

Toki AI

Toki AI schedules events through messaging apps

Ikko Earbuds

Touchscreen translation assistant for AI earbuds

Action Figure Generator

Create custom collectible action figures made by AI

Spot AI

Transform cameras into smart video intelligence

Miko

AI interactive learning companion for children

Comet

Smart browser with AI features available for any website

RECENT AI TOOLS

OpenRouter

MINT AI

Toki AI

Ikko Earbuds

Action Figure Generator

RECENT AI NEWS

Reddit Sues Perplexity and AI Data Scraping Companies for Unauthorized Use of Its Data

Google Cloud Launches Nvidia G4 AI Virtual Machines

Multiple Users Report ChatGPT's Impact on Mental Health, Seek Help from FTC

Meta Cuts 600 Jobs in Artificial Intelligence Division

Leena Opens "AI Colleague Studio" for Enterprise Agent Customization

OpenAI Requests List of Participants in ChatGPT Suicide Lawsuit Memorials

Amazon integrates AI with robotics and smart glasses to streamline delivery processes

Amazon Launches AI Smart Glasses for Delivery Drivers

RECENT AI TOOLS