Comparative Analysis of Quantization Accuracy: Gemma vs. Qwen Models
A community-driven empirical evaluation exploring the impact of various quantization levels on the accuracy of Gemma and Qwen models, specifically focusing on arithmetic precision and factual recall.
Overview of the Benchmark Methodology
Evaluating the performance of quantized Large Language Models (LLMs) often relies on Kullback–Leibler Divergence (KLD) metrics. However, KLD numbers can be difficult to interpret for practical deployment and do not easily allow for cross-model comparisons—such as comparing a 9B parameter model at 4-bit quantization (Q4) against a 4B parameter model at 8-bit quantization (Q8).
To address this gap, a series of contrived tests were conducted to measure actual output accuracy across different quantization schemes for the Gemma and Qwen families.
Test Suite Details
Test 1: Arithmetic Precision
The first benchmark focused on the models' ability to handle large-scale integer addition. The test consisted of 1,000 questions designed to evaluate numerical stability and precision under quantization. To ensure clean data collection, strict prompting was used to constrain the output to a single numerical value without commas or underscores.
Sample Prompt: "Print only one number as the answer to the following question. Print nothing else, please. Do not use commas or underscores. It is very important. 998604052310776342 + 249349834805792420 = ?"
Test 2: Factual Recall (Presidents)
The second benchmark evaluated the models' knowledge retrieval capabilities through a set of 46 questions regarding presidents, testing how quantization affects the retention of specific factual data.
Limitations of the Analysis
Note: The provided source material is an excerpt and does not include the final result sets or the specific performance percentages for each quantization level. Consequently, the comparative conclusions between the Gemma and Qwen architectures cannot be fully detailed in this report.
Original Source