Google's Gemma 3 QAT Language Model Can Run Locally on Consumer GPUs

2025-04-30

Google has launched the Gemma 3 QAT series, a quantized version of its open-weight Gemma 3 language model. These models maintain high accuracy through Quantization-Aware Training (QAT) as the weights are reduced from 16 bits to 4 bits.

All four sizes of the Gemma 3 models now come in QAT versions: 1B, 4B, 12B, and 27B parameters. The quantized versions require only 25% of the VRAM needed by their 16-bit counterparts. Google claims that the 27B model can run on a desktop NVIDIA RTX 3090 GPU with 24GB of VRAM, while the 12B model can operate on a laptop NVIDIA RTX 4060 GPU with 8GB of VRAM. Smaller models can run on mobile phones or other edge devices. By applying quantization-aware training, Google managed to reduce precision loss caused by quantization by up to 54%. According to Google,

While top-tier performance on high-end hardware is ideal for cloud deployments and research, we’ve heard you loud and clear: you want the power of Gemma 3 on the hardware you already have. We’re committed to making powerful AI accessible, which means achieving efficient performance on consumer-grade GPUs found in desktops, laptops, and even phones... Bringing state-of-the-art AI performance to accessible hardware is a key step toward democratizing AI development... We can’t wait to see what you build running Gemma 3 locally!

Google introduced the Gemma series in 2024, followed by Gemma 2. By incorporating design elements from Google's flagship Gemini LLM, these open-source models deliver competitive performance compared to models twice their size. According to Google, the latest iteration, Gemma 3, represents a "top-tier open compact model" thanks to its performance improvements. Gemma 3 also adds visual capabilities, except for the 1B variant.

While the non-quantized Gemma 3 models show impressive performance for their size, they still demand significant GPU resources. For instance, the non-quantized 12B model requires an RTX 5090 with 32GB of VRAM. To quantize the model weights without sacrificing performance, Google employed QAT. This technique simulates inference-time quantization during training rather than simply quantizing the model post-training.

Google developer Omar Sanseviero discussed the use of QAT models in a post on X, noting there’s still room for improvement:

We still recommend using the model (e.g., we haven't quantized embeddings, and some people have even gone for 3-bit quantization, which performs better than naive 4-bit).

Users on Hacker News praised the performance of the QAT models:

I had some private "atmosphere check" questions, and the 4-bit QAT 27B model answered them all correctly. I’m a bit shocked by the information density locked in just 13 GB of weights. If anyone from Deepmind is reading this — Gemma 3 27B is the most impressive open-source model I've used. Great job!

Simon Willison, co-creator of the Django web framework, shared his experiments with these models:

I spent some time accessing my laptop from my phone via Open WebUI and Tailscale, and I think this might be my new favorite general-purpose local model. Ollama seems to use 22GB of RAM at runtime, leaving enough memory for other applications on my 64GB machine.

The Gemma 3 QAT model weights are available on HuggingFace and several popular LLM frameworks, including Ollama, LM Studio, Gemma.cpp, and llama.cpp.