Performance-optimization benchmarks like GSO and SWE are evaluating coding agents by analyzing real repositories, comparing against baselines. These leaderboards help track agent progress but may reflect benchmark limitations. Read more here