Google Launches Gemma 4: The Licensing Change That Really Matters

The benchmarks are impressive. But that’s not the headline.

Google DeepMind launched Gemma 4 on April 2nd — four open-weight models ranging from edge devices to datacenter GPUs — and the technical specs are genuinely solid. The 31B Dense model ranks #3 on the AI Arena leaderboard for open models. The 26B Mixture of Experts model places #6 while activating just 3.8 billion parameters during inference. The unquantized weights of the 31B fit on a single 80GB H100; quantized versions run on consumer GPUs.

But developers who’ve been following the open-weight model space for 18 months already knew Google could ship capable models. What they were watching for was something else: whether Google would finally eliminate the legal friction that made it risky to deploy Gemma in enterprise environments.

It did.


Apache 2.0: The Decision That Changes Everything

Previous generations of Gemma were distributed under a proprietary license that prohibited certain deployment scenarios and reserved Google’s right to terminate access if users didn’t meet their conditions. In practice, this meant enterprise and sovereign deployments treated Gemma as a liability. Legal teams said no. Procurement said to wait.

Gemma 4 launches under Apache 2.0. Period.

Hugging Face cofounder Clément Delangue called this “a massive milestone” — and he’s right, though perhaps not for the obvious reasons. Apache 2.0 doesn’t just make Gemma legally safer. It makes it strategically comparable to models from Mistral, Meta, and Chinese labs that have been eroding Google’s mindshare in the open model category among enterprises. A model you can deploy without a legal review conversation is a model that actually gets deployed.

For development teams across Latin America building for regulated industries — fintech, healthcare, government — this matters in ways that benchmark tables won’t capture.


What the Four-Size Strategy Is Really Saying

It’s worth analyzing the lineup carefully:

E2B and E4B — Designed for Android phones, Raspberry Pi, and Jetson Nano hardware. Native audio included. Context window: 128K. These models are up to 4x faster than Gemma 3 on equivalent hardware, with 60% less battery consumption. They’ll also be the foundation of Gemini Nano 4, Google’s next on-device model for Android, coming to consumer devices later this year.

26B MoE (3.8B active) — The latency play. 128 experts, 3.8 billion activated per inference. Context window: 256K. This is the model you run when you need fast tokens per second with limited VRAM. Local code assistants, edge inference, lower-cost cloud deployments.

31B Dense — The quality ceiling. Context window: 256K. Fits unquantized on a single H100; runs quantized on consumer GPUs. Currently the world’s #3 open model on the AI Arena text leaderboard.

All four models are multimodal — images and video across the family; audio on the two edge variants. Native function calling and JSON structured output are baked in, which matters for agentic workflows.

The range is deliberate. Google isn’t launching a model — it’s launching a platform that can run on a Raspberry Pi and in a datacenter H100 under the same Apache 2.0 licensing umbrella. That’s a coherent answer to a question enterprises have been wrestling with: “How do we run AI locally in some contexts and at scale in others, without maintaining two completely separate model families?”


The Competitive Context That Explains the Urgency

The Register put it bluntly in their coverage: this launch is a direct response to open-weight models from Moonshot AI, Alibaba, and Z.AI, many of which already rival frontier proprietary models. Google is offering enterprise customers an alternative — one that won’t absorb sensitive corporate data to train future models, that runs on your own hardware, and that now comes with a license that legal teams can actually approve.

For Latin American development teams, the “domestic alternative” framing matters less than sovereignty. A model you can run entirely on your own infrastructure, under Apache 2.0, in over 140 languages, is a model around which you can build real data governance arguments. That’s not a minor consideration in regulated verticals.


What to Actually Do with All This

If you’re evaluating open-weight models for production use today:

The E4B is worth testing immediately for any mobile or on-device use case. The combination of native audio, multimodal input, and Apache 2.0 makes it the most commercially viable small model Google has shipped to date.

For local code assistants and RAG pipelines, the 26B MoE is the architecture to watch closely. The inference speed advantage on limited hardware is real — this is where the MoE efficiency argument actually holds up in practice.

For fine-tuning experiments, the 31B Dense has a direct path through Colab, Vertex AI, and Unsloth. Google confirmed day-one support on HuggingFace (Transformers, TRL), vLLM, llama.cpp, MLX, Ollama, NVIDIA NIM, and LM Studio.

The Gemma family already surpassed 400 million total downloads and over 100,000 community variants. Gemma 4 under Apache 2.0 is the first version where that community momentum will translate directly into production deployments at scale.

That’s the headline.


Is your team evaluating open-weight models for production? What criteria matter most — performance, cost, licensing, or data sovereignty? Let us know in the comments :backhand_index_pointing_down: