Technical Comparison

SIMA 2 vs SIMA 1: More Than Just Doubling Performance

Behind the jump from 31% to 65% success rate
Lies three fundamental architectural breakthroughs

Looking only at numbers, SIMA 2's task success rate jumping from SIMA 1's 31% to 65% seems like "performance doubled." But in AI research, such leaps often signify qualitative rather than quantitative change—a watershed from "lab demonstration" to "approaching usable."

Core Metrics Comparison

MetricSIMA 1SIMA 2Improvement
Cross-game Task Success Rate31%65%+110%
Complex Instruction UnderstandingBasic (single-step)Advanced (multi-step + conditional)Qualitative leap
Unseen Game GeneralizationModerateStrongSignificant
Training Data Scale~5,000 hours~50,000 hours (incl. self-generated)+900%
Inference Speed~200ms/action~150ms/action+25%
Environment AdaptationRequires pre-trainingZero-shot adaptationRevolutionary

What Does "31% vs 65%" Mean?

  • SIMA 1's 31%: Out of 10 tasks, ~3 complete successfully. 7 fail (gets lost, wrong tool, stuck).
  • SIMA 2's 65%: Out of 10 tasks, ~6-7 complete successfully. Only 3-4 require human intervention.

Why This is Qualitative Change:

31% is "occasionally gets lucky" level. 65% is "reliable most of the time" level. This is the threshold from "tech demo" to "practical tool."

Architecture Comparison

SIMA 1's Architecture (2024)

Design Philosophy: Specialized vision encoder + behavior policy network

Core Components:

  • Vision Encoder: Custom-trained CNN processing game screenshots
  • Language Encoder: BERT-like transformer for instruction understanding
  • Policy Network: Reinforcement learning model for decision-making

Limitations:

  • ❌ Vision encoder requires large amounts of game-specific training
  • ❌ Limited generalization (new games need retraining)
  • ❌ Weak language understanding (only simple instructions)

SIMA 2's Architecture (2025)

Design Philosophy: Direct integration of Gemini 2.5 Flash Lite

Core Innovation: Replacing specialized encoders with Gemini

Example Difference:

Instruction: "If you see iron ore, mine it; otherwise continue exploring"

  • SIMA 1: Identifies keywords but doesn't understand conditional logic → May execute incorrectly
  • SIMA 2: Fully understands conditional logic → Scans for iron ore → Executes correctly based on condition

Key Upgrades:

  1. Gemini Integration: Inherent vision + language + reasoning capabilities
  2. Task Planning Module: Can generate multi-step action sequences instead of single actions
  3. Zero-Shot Adaptation: Can play new games directly without additional training

Training Method Comparison

SIMA 1's Training Process

  1. Phase 1: Human demonstration collection (5,000+ hours of gameplay, manually annotated)
  2. Phase 2: Imitation learning (AI watches and copies human players)
  3. Phase 3: Reinforcement learning fine-tuning

Problems: High cost, limited data, poor generalization

SIMA 2's Training Process

  1. Phase 1: Inherit Gemini's knowledge (already knows common objects and behaviors)
  2. Phase 2: Small human demonstration (~1,000 hours, 5x less than SIMA 1)
  3. Phase 3: Self-improvement loop (AI generates tasks → attempts → learns → generates more tasks)

Analogy:

  • SIMA 1: Teacher gives 100 example problems, student memorizes
  • SIMA 2: Teacher gives 10 examples, student understands principles and creates 1,000 practice problems

Generalization Capability Comparison

SIMA 1's Generalization

Test: Trained on Valheim → Test on No Man's Sky

Results:

  • ✅ Similar tasks (both "chopping wood"): ~40% success
  • ❌ Different tasks ("pilot spaceship"): <10% success
  • ❌ Completely new game (ASKA): Nearly fails

SIMA 2's Generalization

Test: Trained on Goat Simulator 3 → Test on ASKA (Viking survival game)

Results:

  • ✅ Success rate 65% (cross-game genres)
  • ✅ Zero-shot adaptation (no additional training)
  • ✅ Understands concepts not in Goat Simulator 3 (like "crafting tools")

Key Insight:

SIMA 1: Learned "in Valheim, see this pixel pattern → press this key"

SIMA 2: Learned "trees are harvestable resources → find tool → aim → execute chopping action"

Practical Application Scenarios

What SIMA 1 Can Do

  • ✅ Repetitive tasks in single games (auto-harvest in Valheim)
  • ❌ Limited to trained tasks only
  • ❌ Doesn't work in different games

What SIMA 2 Can Do

  • ✅ Universal tasks across games ("build shelter" in any survival game)
  • ✅ Complex multi-step planning ("Build boat → sail to shore → chop wood → build house")
  • ✅ Handle emergencies (attacked by monsters → flee or fight back)

Cost and Efficiency Comparison

ItemSIMA 1SIMA 2
Model Size~1B parameters~20B parameters (Gemini)
Inference CostLowMedium
Training CostMediumHigh (but more efficient)
Time to Learn New Game6-8 weeks1-2 weeks

Efficiency Boost: 3-4x

Limitations Comparison

SIMA 1's Limitations

  • ❌ Weak generalization
  • ❌ Limited language understanding
  • ❌ Poor planning ability
  • ❌ High data dependency

SIMA 2's Limitations

  • ❌ Success rate still not high enough (65% means 35% failure)
  • ❌ Can't handle very fast-paced games (FPS requiring millisecond reactions)
  • ❌ Depends on clear visuals (struggles with blur, fog, glare)
  • ❌ Cost still very high (~$0.10-1 per minute)
  • ❌ Ethics and safety issues unresolved

Future Outlook

SIMA 1's Future

Status: "Retired," no longer actively developed

Legacy: Proved feasibility of "general game AI"

SIMA 2's Future

Short-term (1-2 years):

  • Expand game coverage (9 → 50+ games)
  • Improve success rate (65% → 80%+)
  • Reduce costs (optimize inference)

Mid-term (3-5 years):

  • Possible limited Beta (API form)
  • Integration into game dev tools (automated testing)
  • Explore "AI co-play" scenarios

Long-term (5-10 years):

  • Transfer to real robots
  • General computer operation AI
  • Part of embodied AGI

Summary: Three Key Takeaways

  1. SIMA 2 is not just "better SIMA 1" - Architecture shift from specialized model → general Gemini, training shift from manual annotation → self-improvement loop. This is paradigm change.
  2. 65% success rate is milestone, but not endpoint - Marks leap from "lab toy" to "practical tool," but still distance from truly reliable.
  3. Focus on long-term value: From games to robots - SIMA 2's true significance isn't "playing games" but exploring the path to general embodied intelligence.

Related Resources