SIMA 2 vs SIMA 1: DeepMind's Game AI Key Upgrades

Looking only at numbers, SIMA 2's task success rate jumping from SIMA 1's 31% to 65% seems like "performance doubled." But in AI research, such leaps often signify qualitative rather than quantitative change—a watershed from "lab demonstration" to "approaching usable."

Core Metrics Comparison

Metric	SIMA 1	SIMA 2	Improvement
Cross-game Task Success Rate	31%	65%	+110%
Complex Instruction Understanding	Basic (single-step)	Advanced (multi-step + conditional)	Qualitative leap
Unseen Game Generalization	Moderate	Strong	Significant
Training Data Scale	~5,000 hours	~50,000 hours (incl. self-generated)	+900%
Inference Speed	~200ms/action	~150ms/action	+25%
Environment Adaptation	Requires pre-training	Zero-shot adaptation	Revolutionary

What Does "31% vs 65%" Mean?

SIMA 1's 31%: Out of 10 tasks, ~3 complete successfully. 7 fail (gets lost, wrong tool, stuck).
SIMA 2's 65%: Out of 10 tasks, ~6-7 complete successfully. Only 3-4 require human intervention.

Why This is Qualitative Change:

31% is "occasionally gets lucky" level. 65% is "reliable most of the time" level. This is the threshold from "tech demo" to "practical tool."

Architecture Comparison

SIMA 1's Architecture (2024)

Design Philosophy: Specialized vision encoder + behavior policy network

Core Components:

Vision Encoder: Custom-trained CNN processing game screenshots
Language Encoder: BERT-like transformer for instruction understanding
Policy Network: Reinforcement learning model for decision-making

Limitations:

❌ Vision encoder requires large amounts of game-specific training
❌ Limited generalization (new games need retraining)
❌ Weak language understanding (only simple instructions)

SIMA 2's Architecture (2025)

Design Philosophy: Direct integration of Gemini 2.5 Flash Lite

Core Innovation: Replacing specialized encoders with Gemini

Example Difference:

Instruction: "If you see iron ore, mine it; otherwise continue exploring"

SIMA 1: Identifies keywords but doesn't understand conditional logic → May execute incorrectly
SIMA 2: Fully understands conditional logic → Scans for iron ore → Executes correctly based on condition

Key Upgrades:

Gemini Integration: Inherent vision + language + reasoning capabilities
Task Planning Module: Can generate multi-step action sequences instead of single actions
Zero-Shot Adaptation: Can play new games directly without additional training

Training Method Comparison

SIMA 1's Training Process

Phase 1: Human demonstration collection (5,000+ hours of gameplay, manually annotated)
Phase 2: Imitation learning (AI watches and copies human players)
Phase 3: Reinforcement learning fine-tuning

Problems: High cost, limited data, poor generalization

SIMA 2's Training Process

Phase 1: Inherit Gemini's knowledge (already knows common objects and behaviors)
Phase 2: Small human demonstration (~1,000 hours, 5x less than SIMA 1)
Phase 3: Self-improvement loop (AI generates tasks → attempts → learns → generates more tasks)

Analogy:

SIMA 1: Teacher gives 100 example problems, student memorizes
SIMA 2: Teacher gives 10 examples, student understands principles and creates 1,000 practice problems

Generalization Capability Comparison

SIMA 1's Generalization

Test: Trained on Valheim → Test on No Man's Sky

Results:

✅ Similar tasks (both "chopping wood"): ~40% success
❌ Different tasks ("pilot spaceship"): <10% success
❌ Completely new game (ASKA): Nearly fails

SIMA 2's Generalization

Test: Trained on Goat Simulator 3 → Test on ASKA (Viking survival game)

Results:

✅ Success rate 65% (cross-game genres)
✅ Zero-shot adaptation (no additional training)
✅ Understands concepts not in Goat Simulator 3 (like "crafting tools")

Key Insight:

SIMA 1: Learned "in Valheim, see this pixel pattern → press this key"

SIMA 2: Learned "trees are harvestable resources → find tool → aim → execute chopping action"

Practical Application Scenarios

What SIMA 1 Can Do

✅ Repetitive tasks in single games (auto-harvest in Valheim)
❌ Limited to trained tasks only
❌ Doesn't work in different games

What SIMA 2 Can Do

✅ Universal tasks across games ("build shelter" in any survival game)
✅ Complex multi-step planning ("Build boat → sail to shore → chop wood → build house")
✅ Handle emergencies (attacked by monsters → flee or fight back)

Cost and Efficiency Comparison

Item	SIMA 1	SIMA 2
Model Size	~1B parameters	~20B parameters (Gemini)
Inference Cost	Low	Medium
Training Cost	Medium	High (but more efficient)
Time to Learn New Game	6-8 weeks	1-2 weeks

Efficiency Boost: 3-4x

Limitations Comparison

SIMA 1's Limitations

❌ Weak generalization
❌ Limited language understanding
❌ Poor planning ability
❌ High data dependency

SIMA 2's Limitations

❌ Success rate still not high enough (65% means 35% failure)
❌ Can't handle very fast-paced games (FPS requiring millisecond reactions)
❌ Depends on clear visuals (struggles with blur, fog, glare)
❌ Cost still very high (~$0.10-1 per minute)
❌ Ethics and safety issues unresolved

Future Outlook

SIMA 1's Future

Status: "Retired," no longer actively developed

Legacy: Proved feasibility of "general game AI"

SIMA 2's Future

Short-term (1-2 years):

Expand game coverage (9 → 50+ games)
Improve success rate (65% → 80%+)
Reduce costs (optimize inference)

Mid-term (3-5 years):

Possible limited Beta (API form)
Integration into game dev tools (automated testing)
Explore "AI co-play" scenarios

Long-term (5-10 years):

Transfer to real robots
General computer operation AI
Part of embodied AGI

Summary: Three Key Takeaways

SIMA 2 is not just "better SIMA 1" - Architecture shift from specialized model → general Gemini, training shift from manual annotation → self-improvement loop. This is paradigm change.
65% success rate is milestone, but not endpoint - Marks leap from "lab toy" to "practical tool," but still distance from truly reliable.
Focus on long-term value: From games to robots - SIMA 2's true significance isn't "playing games" but exploring the path to general embodied intelligence.

SIMA 2 vs SIMA 1: More Than Just Doubling Performance