Gemini 3.1 Ultra: 2M Token Context, Native Code Execution, and What It Really Means for Devs

Google launched Gemini 3.1 Ultra in April 2026, and after spending time with it, I want to cut through the benchmark noise and focus on what matters for developers who are actually building things.

The 2 million token context window is real — and it’s different

Two million tokens sounds like a marketing number until you realize what fits: a complete mid-size codebase, an entire book, dozens of documentation files, or a long video — all in a single prompt. I’ve seen 1M token windows degrade badly past the halfway mark; Google claims that 3.1 Ultra maintains coherence into the final third of long contexts. That’s the claim worth testing.

The use case this enables isn’t “chat with your codebase” — it’s eliminating the chunking, summarization, and retrieval pipelines that make long-context applications fragile. If the coherence claim holds, that’s an architectural simplification, not just a capability improvement.

Native multimodal reasoning — not transcription

Most “multimodal” models secretly serialize their inputs: they transcribe audio to text, describe images to text, then reason over the combined text. Gemini 3.1 Ultra reasons natively over video frames, audio waveforms, images, and text simultaneously. This matters for tasks like reviewing a screen recording of a bug, analyzing a technical diagram alongside its source code, or building agents that operate in mixed-media environments without losing fidelity in translation.

Native code execution — no plugin

This is the feature I’d highlight most for developers: 3.1 Ultra writes Python, executes it in a sandboxed environment, observes the output, and revises — all natively, without a third-party Code Interpreter plugin. The loop is tighter, the integration is cleaner, and the model makes decisions based on actual runtime behavior rather than predicted output.

For data analysis, automated testing, or any workflow where “write code, run it, adjust” is the central loop, this matters.

The numbers: 94% on GPQA Diamond

GPQA Diamond measures graduate-level reasoning in biology, chemistry, and physics — the kind of multi-step problems that require genuine domain understanding, not pattern matching. 94% is the headline. I’ll note that benchmark performance and production performance often diverge, and 3.1 Ultra is new enough that real-world evaluations are still sparse. The number is notable; treat it as a floor to validate, not a ceiling to celebrate.

Where it’s available

  • Gemini Advanced (gemini.google.com) — consumer-facing, by subscription
  • Google AI Studio — free tier for experimentation, with rate limits
  • Gemini API — for production integration

A note on costs: the 2M token window is powerful, but per-token costs at that scale add up. For exploratory or high-volume workloads, run the numbers against AI Studio pricing before committing architectural decisions built around maximum context.

My take

Gemini 3.1 Ultra is the most complete multimodal model Google has released. Native code execution and the coherence claims at scale are the two things I’m watching most closely — if both hold under production conditions, this changes how I’d approach certain agentic architectures. The context window alone is a genuine engineering advantage over anything at 200K.

The question isn’t whether 3.1 Ultra is impressive. It is. The question is whether the coherence claims survive contact with real workloads — and that answer is going to come from the community over the coming weeks, not from Google’s benchmarks.