Google's Latest App Runs AI on Your Phone Offline

2025-06-04

Google has launched a new app that, while not necessarily requested by anyone, everyone seems eager to try.

AI Edge Gallery quietly went live on May 31, bringing artificial intelligence directly to your smartphone - no cloud, no internet, and no sharing of your data with big tech company servers required.

This experimental application was released under the Apache 2.0 license, allowing virtually unrestricted use for any purpose - available on GitHub, initially launching on Android platforms. An iOS version is coming soon.

It can run models like Google's Gemma 3n completely offline, handling everything from image analysis to code generation using only the phone's hardware.

Surprisingly, the performance is quite impressive.

The app currently appears targeted at developers, featuring three main functions: AI Chat for conversations, Image Query for visual analysis, and Prompt Lab for one-time tasks such as text rewriting.

Users can download models from platforms like Hugging Face, though choices are still limited to formats like Gemma-3n-E2B and Qwen2.5-1.5 B.

Reddit users immediately questioned the novelty of the app, comparing it with existing solutions like PocketPal.

Some raised security concerns, although the app is hosted on Google's official GitHub, refuting impersonation claims. No malware evidence has been found so far.

We tested the application on a Samsung Galaxy S24 Ultra, downloading both the largest and smallest available Gemma 3 models.

Each AI model is a self-contained file containing all its "knowledge" - think of it as downloading a compressed snapshot of everything the model learned during training, rather than a massive database like a local Wikipedia app. The largest Gemma 3 model available in the app is about 4.4 GB, while the smallest is approximately 554 MB.

Once downloaded, no additional data is needed - the model runs entirely on your device, using the knowledge it acquired before release to answer questions and perform tasks.

Even with slow CPU inference, the experience compares to when GPT-3.5 was first released: the larger model isn't fast, but definitely usable.

The smaller Gemma 3 1B model achieves speeds exceeding 20 tokens per second, providing a smooth experience and reliable accuracy under supervision.

This becomes particularly important when you're offline or working with sensitive data you don't want to share with Google or OpenAI's training algorithms, which by default use your data unless you opt out.

When performing GPU inference on the smallest Gemma model, pre-filling speed exceeds 105 tokens per second, while CPU inference reaches 39 tokens per second. Token output - the speed at which the model generates responses after thinking - averages 10 tokens per second on GPU and 7 tokens per second on CPU.

Multimodal capabilities performed well in testing.

Additionally, CPU inference on smaller models appears to produce better results than GPU inference, though this may be anecdotal; however, it has been observed across various tests.

For example, during a visual task, the model using CPU inference accurately guessed my age and my wife's age in the test photo: I'm in my late 30s, and my wife is in her late 20s.

The supposed superior GPU inference got my age wrong, thinking I'm in my 20s (though I'd rather accept this "information" than reality).

Google's models come with strict moderation, but basic jailbreaking can be achieved with minimal effort.

Unlike centralized services that ban users for bypass attempts, local models don't report your prompts, allowing jailbreak techniques to be used without risking subscription cancellation or asking for information that censored versions wouldn't provide.

Third-party model support is available but comes with some limitations.

The app only accepts .task files, not the widely adopted .safetensor format supported by competitors like Ollama.

This significantly limits available models, though methods exist to convert .safetensor files to .task, but these aren't suitable for everyone.

Code handling works reasonably well, though specialized models like Codestral would handle programming tasks more efficiently than Gemma 3. Again, a .task version must exist, but this could be a very effective alternative.

For basic tasks like rewriting, summarizing, and explaining concepts, the models perform excellently without sending data to Samsung or Google servers.

Thus, users don't need to grant large tech companies access to their inputs, keyboard, or clipboard as their own hardware handles all necessary processing.

A context window of 4096 tokens