Google DeepMind released Gemma 4 on April 2, 2026, and the local AI community immediately started running it on Apple Silicon. The Hacker News thread heated up fast — 139 points, 51 comments, and a brutally honest update from the author: the 26B model killed his Mac mini.
That’s the real story here, and it’s more useful than any setup guide. Because if you have 16GB, 24GB, or 32GB of unified memory, that completely changes which Gemma 4 variant you should use.
We break it down.
What is Gemma 4 and why it matters
Gemma 4 is Google DeepMind’s latest family of open models, released under the Apache 2.0 license — total commercial freedom, no licensing restrictions. It comes in four sizes:
- E2B — edge model, ~3GB, for very limited hardware
- E4B — edge model, ~5GB, the sweet spot for 8–16GB laptops
- 26B MoE — the star model. 26 billion total parameters, but only ~3.8B activate per inference via Mixture-of-Experts architecture. Near-30B quality at the speed and memory footprint of an 8B model.
- 31B Dense — maximum quality, all parameters active. Needs serious hardware.
The 26B MoE is what everyone’s excited about, and rightfully so — near-30B quality at a fraction of the memory cost. On paper. In practice, the landscape depends heavily on your machine.
The RAM reality: pick your variant
| RAM | Recommended Variant | Download Size | Experience |
|---|---|---|---|
| 16GB | E4B (Q8) | ~5GB | Comfortable, fast |
| 24GB | 8B default (Q4_K_M) | ~9.6GB | Comfortable, ~14GB headroom |
| 24GB | 26B MoE (Q4_K_M) | ~17GB | |
| 32GB+ | 26B MoE (Q4_K_M) | ~17GB | Solid, 8K–16K context window |
| 48GB+ | 31B Dense (Q4) | ~20GB | Maximum quality, generous context |
The honest verdict for 24GB: The 26B MoE technically fits in memory but leaves barely ~7GB for macOS. With concurrent Ollama requests — multiple tabs, code agents, anything running in parallel — the system starts heavy swapping, becomes unresponsive, and can kill processes. The HN thread author tested it all day and dropped down to the 8B default, which leaves ~14GB of headroom and runs without drama.
If you have 24GB, the 8B default is the pragmatic choice. If you have 32GB, that’s when the 26B becomes the target.
Setup: Ollama on Apple Silicon
Ollama is the simplest path. Since v0.19, it uses Apple’s MLX framework automatically — you don’t need extra configuration.
# Install via Homebrew
brew install --cask ollama
# Pull your variant (pick one)
ollama pull gemma4 # 8B default (~9.6GB) — for 24GB Macs
ollama pull gemma4:26b # 26B MoE (~17GB) — for 32GB+
ollama pull gemma4:e4b # Edge 4B (~5GB) — for 16GB Macs
# Verify GPU acceleration is active
ollama run gemma4 "Hey, what models do you know?"
ollama ps # Should show the CPU/GPU split heavily weighted toward GPU
Important: Make sure you’re on Ollama v0.20.2 or higher. There was a bug in tool call responses with Gemma 4, and you want the fix before you start debugging weird outputs.
Key configuration: keep the model in memory
By default, Ollama unloads models from memory after 5 minutes of inactivity. Reloading a 26B model takes 15–30 seconds — annoying if you’re using it as a dev server. Two fixes:
Environment variables (add to ~/.zshrc):
# Offload as many layers as possible to GPU (unified memory = always want this)
export OLLAMA_NUM_GPU=99
# Keep the model loaded indefinitely (you can set something like "30m" if you prefer)
export OLLAMA_KEEP_ALIVE=0
LaunchAgent for auto-start and preloading:
# Enable Ollama at login via the menu bar icon → Launch at Login
# Then create a preload agent to keep the model warm:
cat << 'EOF' > ~/Library/LaunchAgents/com.ollama.preload-gemma4.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key><string>com.ollama.preload-gemma4</string>
<key>ProgramArguments</key>
<array>
<string>/opt/homebrew/bin/ollama</string>
<string>run</string>
<string>gemma4:latest</string>
<string></string>
</array>
<key>RunAtLoad</key><true/>
<key>StartInterval</key><integer>300</integer>
<key>StandardOutPath</key><string>/tmp/ollama-preload.log</string>
<key>StandardErrorPath</key><string>/tmp/ollama-preload.log</string>
</dict>
</plist>
EOF
launchctl load ~/Library/LaunchAgents/com.ollama.preload-gemma4.plist
This sends an empty prompt every 5 minutes, keeping the model resident in memory. Run ollama ps to confirm — you should see your model with status Forever.
Custom context window (32GB+ only)
The default context in Ollama is 2048 tokens. With 32GB+ and the 26B variant, you can raise that limit:
cat << 'EOF' > Modelfile
FROM gemma4:26b-q4_K_M
PARAMETER num_ctx 8192
PARAMETER temperature 0.7
EOF
ollama create gemma4-custom -f Modelfile
ollama run gemma4-custom
Watch Activity Monitor → Memory Pressure while testing. If it turns yellow, lower the context. With 32GB the 8K sweet spot is solid; with 48GB+ you can comfortably go to 16K.
The real story with Gemma 4 for coding agents
This is where you need to be careful. The HN thread is full of people who tried connecting Gemma 4 to OpenCode, LM Studio, or other coding frontends and hit tool call failures.
LM Studio reports 100% tool call failures at the moment (a Jinja template bug). Ollama v0.20.2 fixes a specific tool call bug, but early community reports on coding agent performance are mixed — several people switched back to Qwen for code tasks.
The Gemma 4 family is brand new. These are growing pains that will resolve in the coming weeks as implementations stabilize. For now: Gemma 4 excels at text processing, data extraction, JSON output, and general reasoning. For coding agents with complex tool use, wait for the dust to settle in your preferred frontend before committing.
Performance reference (llama.cpp on M5 32GB)
A developer in the HN thread running gemma4-26B-A4B-it-GGUF:Q4_K_M via llama.cpp on an M5 Mac with 32GB reported ~38 tokens per second — fast enough to feel genuinely interactive. That’s generation speed; prompt processing is faster.
For reference: the $1,399–$1,599 M4 Pro Mac Mini with 24GB of unified memory running the 8B model gives you a comfortable, quiet dev server setup with zero API costs.
Summary
Pick your variant before running ollama pull:
- 16GB Mac:
gemma4:e4b— no drama - 24GB Mac:
gemma4(8B default) — comfortable, the honest choice - 32GB+ Mac:
gemma4:26b— the star model, actually works - 48GB+ Mac Studio:
gemma4:31b— maximum quality, no compromises
Set OLLAMA_NUM_GPU=99, tweak the keep-alive, add the LaunchAgent if you want the model warm from startup. Update to Ollama v0.20.2+ before testing tool use.
Having a class-26B model running locally with zero API costs and total data privacy is genuinely useful. Just pick the right variant for your hardware.
Sources: Hacker News Thread — Original author’s Gist — Gemma 4 Hardware Guide — Ollama library: gemma4
