The PACE framework investigates whether expensive, time-consuming agentic benchmarks like SWE-Bench and GAIA can be predicted using cheaper, non-agentic LLM benchmarks. By focusing on individual capabilities such as reasoning and code generation, the researchers aim to create a more efficient proxy for evaluating agentic capability. This approach seeks to reduce the high infrastructure costs and time requirements associated with full-scale agent evaluations.
Read original
huggingface/daily-papers