Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
Researchers propose a new framework for evaluating the adversarial robustness of Large Language Models (LLMs) by accounting for the computational cost of attacks, arguing that traditional Attack Success Rate (ASR) metrics fail to capture the true effort required to compromise a model.
The Limitation of Fixed-Budget Evaluations
Current methodologies for evaluating the adversarial robustness of Large Language Models (LLMs) predominantly rely on the Attack Success Rate (ASR) measured under fixed query budgets. This approach operates on the implicit assumption that all adversarial attacks are equally costly. However, in real-world scenarios, the computational resources required to execute different attack strategies can vary by several orders of magnitude.
By focusing solely on ASR within a fixed budget, researchers may obscure the actual effort an attacker must expend to successfully "jailbreak" a model. This creates a gap in understanding whether the computational payoff justifies the effort for a potential adversary, potentially leading to misleading conclusions about a model's true security posture.
Introducing Compute-Aware Robustness
To address this discrepancy, the authors—Malikeh Ehghaghi, Boglárka Ecsedi, Marsha Chechik, and Colin Raffel—propose a compute-aware evaluation framework. This approach shifts the focus from a binary success/failure metric within a rigid budget to a more nuanced analysis of the computational expense associated with successful adversarial perturbations.
By integrating compute costs into the evaluation, the framework aims to provide a more accurate representation of a model's resilience, allowing developers to determine if the cost of a successful attack acts as a sufficient deterrent or if the model remains vulnerable to low-cost, high-efficiency exploits.
Note: The provided source material was truncated. Detailed specifics regarding the proposed "co-evaluation" methodology and the empirical results of the study are not available in the provided snippet.