Google's Gemma 3 QAT Language Model Can Run Locally on Consumer GPUs AI NEWS

Home
AInews
Google's Gemma 3 QAT Language Model Can Run Locally on Consumer GPUs

Google's Gemma 3 QAT Language Model Can Run Locally on Consumer GPUs

2025-04-30

Google has launched the Gemma 3 QAT series, a quantized version of its open-weight Gemma 3 language model. These models maintain high accuracy through Quantization-Aware Training (QAT) as the weights are reduced from 16 bits to 4 bits.

All four sizes of the Gemma 3 models now come in QAT versions: 1B, 4B, 12B, and 27B parameters. The quantized versions require only 25% of the VRAM needed by their 16-bit counterparts. Google claims that the 27B model can run on a desktop NVIDIA RTX 3090 GPU with 24GB of VRAM, while the 12B model can operate on a laptop NVIDIA RTX 4060 GPU with 8GB of VRAM. Smaller models can run on mobile phones or other edge devices. By applying quantization-aware training, Google managed to reduce precision loss caused by quantization by up to 54%. According to Google,

While top-tier performance on high-end hardware is ideal for cloud deployments and research, we’ve heard you loud and clear: you want the power of Gemma 3 on the hardware you already have. We’re committed to making powerful AI accessible, which means achieving efficient performance on consumer-grade GPUs found in desktops, laptops, and even phones... Bringing state-of-the-art AI performance to accessible hardware is a key step toward democratizing AI development... We can’t wait to see what you build running Gemma 3 locally!

Google introduced the Gemma series in 2024, followed by Gemma 2. By incorporating design elements from Google's flagship Gemini LLM, these open-source models deliver competitive performance compared to models twice their size. According to Google, the latest iteration, Gemma 3, represents a "top-tier open compact model" thanks to its performance improvements. Gemma 3 also adds visual capabilities, except for the 1B variant.

While the non-quantized Gemma 3 models show impressive performance for their size, they still demand significant GPU resources. For instance, the non-quantized 12B model requires an RTX 5090 with 32GB of VRAM. To quantize the model weights without sacrificing performance, Google employed QAT. This technique simulates inference-time quantization during training rather than simply quantizing the model post-training.

Google developer Omar Sanseviero discussed the use of QAT models in a post on X, noting there’s still room for improvement:

We still recommend using the model (e.g., we haven't quantized embeddings, and some people have even gone for 3-bit quantization, which performs better than naive 4-bit).

Users on Hacker News praised the performance of the QAT models:

I had some private "atmosphere check" questions, and the 4-bit QAT 27B model answered them all correctly. I’m a bit shocked by the information density locked in just 13 GB of weights. If anyone from Deepmind is reading this — Gemma 3 27B is the most impressive open-source model I've used. Great job!

Simon Willison, co-creator of the Django web framework, shared his experiments with these models:

I spent some time accessing my laptop from my phone via Open WebUI and Tailscale, and I think this might be my new favorite general-purpose local model. Ollama seems to use 22GB of RAM at runtime, leaving enough memory for other applications on my 64GB machine.

The Gemma 3 QAT model weights are available on HuggingFace and several popular LLM frameworks, including Ollama, LM Studio, Gemma.cpp, and llama.cpp.

Marvel

Marvel - Interactive prototyping tool for seamless team collaboration

Coolors

Coolors - Generate custom color palettes

Khroma

Khroma - AI tool for generating personalized color palettes

Kiro AI

Kiro AI - AI IDE transforming prompts into actionable specs

Watermark Remover

Watermark Remover - AI tool for automatic watermark removal

Geo Finder AI

Geo Finder AI - AI tool for identifying locations in media

Mailteorite

Mailteorite - AI email generator that reflects your brand

RECENT AI TOOLS

Visual Electric

Marvel

Coolors

Khroma

Kiro AI

RECENT AI NEWS

AWS Launches Vector Capabilities on Amazon S3

Google Launches Opal, a No-Code Tool for Building AI Mini-Apps

Qwen Launches Qwen3-Coder: Large Agent-Based Coding Model with Open Tools

New ChatGPT Agent Enables Booking, Browsing, and Form Filling—But Trust It Carefully

Trump Reveals Consideration of Splitting NVIDIA During AI Plan Speech

Cognition's AI Developer 'Devin' Eyes $10 Billion Valuation

Leena AI Introduces Voice-Functional AI 'Colleague' to Enhance Workplace Collaboration

Elon Musk Announces AI-Powered Reboot of Vine

RECENT AI TOOLS