Claw-SWE-Bench: Standardizing the Evaluation of OpenClaw-style Agent Harnesses for Software Engineering

Researchers introduce Claw-SWE-Bench, a specialized benchmark and adapter protocol designed to bridge the gap between general-purpose autonomous agents and the rigorous evaluation requirements of SWE-bench, enabling fair comparisons of heterogeneous agent harnesses.

The Challenge of Evaluating General-Purpose Agents

General-purpose agents, such as OpenClaw, are increasingly deployed as autonomous tool users capable of complex reasoning and execution. However, measuring their proficiency in software engineering tasks using existing frameworks like SWE-bench presents a significant technical hurdle. The primary issue lies in the architectural mismatch: a generic agent does not inherently adhere to the strict operational contracts required for SWE-bench scoring, which include maintaining a clean Docker workspace, generating a precise patch, and providing a specific prediction format.

Introducing Claw-SWE-Bench

To resolve these discrepancies, the authors propose Claw-SWE-Bench. This framework serves as both a multilingual SWE-bench-style benchmark and an adapter protocol. By implementing this protocol, developers can wrap heterogeneous agent harnesses—referred to as "claws"—into a standardized interface. This ensures that different agent architectures can be evaluated under identical conditions, eliminating variables that could skew performance results.

Key Features of the Framework

The Claw-SWE-Bench framework introduces several critical constraints to ensure a fair and reproducible evaluation environment:

  • Fixed Prompting: Standardizes the input to ensure that performance gains are attributed to the agent's capabilities rather than prompt engineering.
  • Runtime Budgeting: Implements a strict runtime budget to measure efficiency and resource utilization.
  • Standardized Workspace: Ensures a consistent environment for tool execution and patch application.

Technical Significance

By providing a unified adapter protocol, Claw-SWE-Bench allows researchers to benchmark "claws" (agent harnesses) across diverse coding tasks without needing to rewrite the core scoring logic of SWE-bench for every new agent architecture. This promotes a more transparent comparison of how different autonomous agents handle real-world software engineering challenges.

Note: Due to the brevity of the provided source, specific performance metrics and detailed architectural specifications of the adapter protocol were not available.

Original Source
LLM Agents Software Engineering Benchmarking OpenClaw SWE-bench