EvoPolicyGym introduces a controlled evaluation framework for Autonomous Policy Evolution, focusing on how agents improve executable policies through iterative feedback. Unlike traditional benchmarks, it utilizes a fixed interaction budget to isolate policy editing from general software engineering progress. The benchmark is instantiated using compact interactive environments to assess the efficiency of harness-model agents.
Read original
huggingface/daily-papers