Looking only at numbers, SIMA 2's task success rate jumping from SIMA 1's 31% to 65% seems like "performance doubled." But in AI research, such leaps often signify qualitative rather than quantitative change—a watershed from "lab demonstration" to "approaching usable."
Core Metrics Comparison
| Metric | SIMA 1 | SIMA 2 | Improvement |
|---|---|---|---|
| Cross-game Task Success Rate | 31% | 65% | +110% |
| Complex Instruction Understanding | Basic (single-step) | Advanced (multi-step + conditional) | Qualitative leap |
| Unseen Game Generalization | Moderate | Strong | Significant |
| Training Data Scale | ~5,000 hours | ~50,000 hours (incl. self-generated) | +900% |
| Inference Speed | ~200ms/action | ~150ms/action | +25% |
| Environment Adaptation | Requires pre-training | Zero-shot adaptation | Revolutionary |
What Does "31% vs 65%" Mean?
- SIMA 1's 31%: Out of 10 tasks, ~3 complete successfully. 7 fail (gets lost, wrong tool, stuck).
- SIMA 2's 65%: Out of 10 tasks, ~6-7 complete successfully. Only 3-4 require human intervention.
Why This is Qualitative Change:
31% is "occasionally gets lucky" level. 65% is "reliable most of the time" level. This is the threshold from "tech demo" to "practical tool."
Architecture Comparison
SIMA 1's Architecture (2024)
Design Philosophy: Specialized vision encoder + behavior policy network
Core Components:
- Vision Encoder: Custom-trained CNN processing game screenshots
- Language Encoder: BERT-like transformer for instruction understanding
- Policy Network: Reinforcement learning model for decision-making
Limitations:
- ❌ Vision encoder requires large amounts of game-specific training
- ❌ Limited generalization (new games need retraining)
- ❌ Weak language understanding (only simple instructions)
SIMA 2's Architecture (2025)
Design Philosophy: Direct integration of Gemini 2.5 Flash Lite
Core Innovation: Replacing specialized encoders with Gemini
Example Difference:
Instruction: "If you see iron ore, mine it; otherwise continue exploring"
- SIMA 1: Identifies keywords but doesn't understand conditional logic → May execute incorrectly
- SIMA 2: Fully understands conditional logic → Scans for iron ore → Executes correctly based on condition
Key Upgrades:
- Gemini Integration: Inherent vision + language + reasoning capabilities
- Task Planning Module: Can generate multi-step action sequences instead of single actions
- Zero-Shot Adaptation: Can play new games directly without additional training
Training Method Comparison
SIMA 1's Training Process
- Phase 1: Human demonstration collection (5,000+ hours of gameplay, manually annotated)
- Phase 2: Imitation learning (AI watches and copies human players)
- Phase 3: Reinforcement learning fine-tuning
Problems: High cost, limited data, poor generalization
SIMA 2's Training Process
- Phase 1: Inherit Gemini's knowledge (already knows common objects and behaviors)
- Phase 2: Small human demonstration (~1,000 hours, 5x less than SIMA 1)
- Phase 3: Self-improvement loop (AI generates tasks → attempts → learns → generates more tasks)
Analogy:
- SIMA 1: Teacher gives 100 example problems, student memorizes
- SIMA 2: Teacher gives 10 examples, student understands principles and creates 1,000 practice problems
Generalization Capability Comparison
SIMA 1's Generalization
Test: Trained on Valheim → Test on No Man's Sky
Results:
- ✅ Similar tasks (both "chopping wood"): ~40% success
- ❌ Different tasks ("pilot spaceship"): <10% success
- ❌ Completely new game (ASKA): Nearly fails
SIMA 2's Generalization
Test: Trained on Goat Simulator 3 → Test on ASKA (Viking survival game)
Results:
- ✅ Success rate 65% (cross-game genres)
- ✅ Zero-shot adaptation (no additional training)
- ✅ Understands concepts not in Goat Simulator 3 (like "crafting tools")
Key Insight:
SIMA 1: Learned "in Valheim, see this pixel pattern → press this key"
SIMA 2: Learned "trees are harvestable resources → find tool → aim → execute chopping action"
Practical Application Scenarios
What SIMA 1 Can Do
- ✅ Repetitive tasks in single games (auto-harvest in Valheim)
- ❌ Limited to trained tasks only
- ❌ Doesn't work in different games
What SIMA 2 Can Do
- ✅ Universal tasks across games ("build shelter" in any survival game)
- ✅ Complex multi-step planning ("Build boat → sail to shore → chop wood → build house")
- ✅ Handle emergencies (attacked by monsters → flee or fight back)
Cost and Efficiency Comparison
| Item | SIMA 1 | SIMA 2 |
|---|---|---|
| Model Size | ~1B parameters | ~20B parameters (Gemini) |
| Inference Cost | Low | Medium |
| Training Cost | Medium | High (but more efficient) |
| Time to Learn New Game | 6-8 weeks | 1-2 weeks |
Efficiency Boost: 3-4x
Limitations Comparison
SIMA 1's Limitations
- ❌ Weak generalization
- ❌ Limited language understanding
- ❌ Poor planning ability
- ❌ High data dependency
SIMA 2's Limitations
- ❌ Success rate still not high enough (65% means 35% failure)
- ❌ Can't handle very fast-paced games (FPS requiring millisecond reactions)
- ❌ Depends on clear visuals (struggles with blur, fog, glare)
- ❌ Cost still very high (~$0.10-1 per minute)
- ❌ Ethics and safety issues unresolved
Future Outlook
SIMA 1's Future
Status: "Retired," no longer actively developed
Legacy: Proved feasibility of "general game AI"
SIMA 2's Future
Short-term (1-2 years):
- Expand game coverage (9 → 50+ games)
- Improve success rate (65% → 80%+)
- Reduce costs (optimize inference)
Mid-term (3-5 years):
- Possible limited Beta (API form)
- Integration into game dev tools (automated testing)
- Explore "AI co-play" scenarios
Long-term (5-10 years):
- Transfer to real robots
- General computer operation AI
- Part of embodied AGI
Summary: Three Key Takeaways
- SIMA 2 is not just "better SIMA 1" - Architecture shift from specialized model → general Gemini, training shift from manual annotation → self-improvement loop. This is paradigm change.
- 65% success rate is milestone, but not endpoint - Marks leap from "lab toy" to "practical tool," but still distance from truly reliable.
- Focus on long-term value: From games to robots - SIMA 2's true significance isn't "playing games" but exploring the path to general embodied intelligence.