How SIMA 2 Works: Technical Architecture Explained

Q: Could this lead to AGI?

SIMA 2 is a step toward embodied AGI—systems that can act intelligently in physical/virtual environments. It's not AGI yet but it's closer than pure language models like GPT-4.

Traditional game AI operates on predefined rules and scripts. SIMA 2 takes a fundamentally different approach: it learns by observation, reasons about its environment, and improves through experience—just like a human player would.

The breakthrough isn't just that SIMA 2 can play games. It's how it learns to play them. Unlike game-specific bots that are hardcoded for one title, SIMA 2 can learn any 3D game from visual observation alone, understand natural language instructions, generalize skills across completely different games, and improve autonomously without human supervision.

This represents a major leap toward general-purpose embodied AI—systems that can act intelligently in virtual and physical environments, not just process text.

The Three-Part Architecture: Perception, Reasoning, Action

SIMA 2's architecture can be understood as three interconnected systems:

1. Visual Perception System

Input: Raw screen pixels + optional language instructions
Output: Semantic understanding of game state

SIMA 2 doesn't have access to game code, internal state, or special APIs. It sees what you see: pixels on a screen. The visual perception system must:

Identify objects and entities: Recognize trees, buildings, characters, UI elements
Understand spatial relationships: Determine distances, directions, navigable terrain
Track temporal changes: Notice movement, state transitions, cause-and-effect
Parse UI information: Read health bars, inventory, quest markers

Key Innovation: SIMA 2 can recognize affordances—understanding not just what objects are, but what you can do with them. A log in one game might be fuel, in another it's a building material, in a third it's a throwable weapon.

The perception system is built on vision transformers trained on millions of hours of gameplay footage. Unlike traditional computer vision that looks for specific patterns, SIMA 2's perception is context-aware—it understands that a "tree" in Minecraft functions differently than a "tree" in Valheim, even though they look similar.

2. Reasoning Engine (Gemini 2.5 Flash Lite)

Input: Perceived game state + language instruction (if provided)
Output: High-level action plan

The reasoning layer is where SIMA 2's intelligence lives. This is powered by Gemini 2.5 Flash Lite, a lightweight version of Google's multimodal foundation model specifically optimized for low latency, efficient memory usage, and strong visual reasoning capabilities.

What Gemini does:

Interprets visual scenes: "I'm in a forest. There's a cave entrance to my left. My health is low."
Plans multi-step sequences: "To build a shelter, I need: wood → crafting table → walls → roof"
Reasons about consequences: "If I attack this creature, I might die. Better to run."
Handles ambiguity: When instructions are vague ("find food"), Gemini generates contextually appropriate sub-goals

Why Gemini specifically?

Multimodal grounding: Can connect visual observations to language concepts
Common-sense reasoning: Understands implicit goals (if health is low, seek healing)
Generalization: Trained on internet-scale data, can apply knowledge from one context to another

3. Action Space & Motor Control

Input: High-level plan from Gemini
Output: Low-level game controls (keyboard/mouse/controller)

The action system translates Gemini's abstract intentions into precise game inputs. This involves a hierarchical action space:

High-level: "Navigate to that tree"
Mid-level: "Walk forward, turn left, avoid obstacle"
Low-level: "Press W key, move mouse 15° left"

SIMA 2 learns reusable skills like "walk to point," "interact with object," and "aim at target" that work across games with similar control schemes. A "jump" in one 3D game transfers to another, even if the exact physics differ slightly.

The Self-Improvement Training Loop: Learning Without Labels

Unlike SIMA 1, which required human-labeled training data for every game, SIMA 2 can teach itself. This happens in three phases:

Phase 1: Human Demonstration → Gemini Labels

Problem: Collecting human gameplay labels is expensive (requires annotators to describe every action).
Solution: Use Gemini to automatically generate labels from raw gameplay videos.

Process:

A human plays a game (e.g., Valheim) while SIMA 2 records screen + controls
Gemini watches the video and infers goals: "The player is gathering wood to build a house"
Gemini generates natural language annotations: "Player walked to tree → used axe → collected logs"
These auto-generated labels become training data

Gemini's world knowledge lets it understand intent from context. If it sees someone chopping trees near a cleared area, it can infer "preparing to build."

Phase 2: Self-Play with Gemini Feedback

Once SIMA 2 has basic competence from Phase 1, it enters autonomous improvement mode:

SIMA 2 attempts tasks: "Build a shelter in Valheim"
Gemini evaluates success: Watches the attempt, provides feedback
- "Task completed successfully"
- "Failed: structure collapsed (walls placed incorrectly)"
SIMA 2 updates policy: Reinforcement learning adjusts behavior based on success/failure
Repeat with increasing difficulty: Tasks get more complex as SIMA 2 improves

Result: SIMA 2 improved from 31% task success rate (SIMA 1) to 65% on held-out games through this self-improvement loop.

Phase 3: Generalization to New Games

The real test: Can SIMA 2 play games it's never seen?

Approaches:

Zero-shot transfer: No training on new game at all
Few-shot transfer: Watch 5-10 minutes of gameplay, then play
Active learning: Request demonstrations only when stuck

What SIMA 2 learns to transfer:

Core mechanics (jumping, attacking, inventory management)
Physics intuitions (gravity, collision, momentum)
Common UI patterns (health bars, minimaps, dialogue boxes)
High-level strategies (resource gathering, base building, exploration)

Example: SIMA 2 trained on Valheim can play ASKA (similar survival game) with 40% success rate despite never seeing it before, because it transfers concepts like "find food," "craft shelter," and "avoid enemies."

Technical Deep Dive: Model Architecture

For developers and researchers, here's the full technical stack:

Vision Encoder

Architecture: ViT-L/14 (Vision Transformer, Large, 14x14 patches)
Input resolution: 224x224 pixels (center crop of 720p gameplay)
Frame rate: 10 FPS (sufficient for most games)
Features: 1024-dimensional embedding per frame
Temporal modeling: 3D convolutions across 16-frame windows

Reasoning Module (Gemini 2.5 Flash Lite)

Parameters: ~10B (exact number not disclosed by DeepMind)
Context window: 32K tokens (visual + language)
Quantization: 4-bit for efficient inference
Latency: 80-120ms per decision
Multimodal fusion: Cross-attention between vision and language tokens

Action Decoder

Architecture: Recurrent policy network (LSTM + MLP)
Output space:
- Discrete actions: 18 categories (move, jump, attack, interact, etc.)
- Continuous parameters: Mouse delta (x, y), camera angles
Action sampling: Temperature-scaled softmax for exploration
Frequency: 10 Hz (one action per frame)

Training Details

Total training compute: ~500,000 TPU-v5 hours
Games in training set: 9 commercial titles + 12 research environments
Gameplay hours: 400,000+ hours of demonstrations
Auto-labeled by Gemini: 350,000 hours
Human-labeled: 50,000 hours (validation set)
Self-play hours: 1,000,000+ (autonomous practice)

Key Technical Innovations

1. Spatiotemporal Attention for Games

Traditional vision models process each frame independently. SIMA 2 uses 3D attention that tracks object permanence, recognizes patterns across time, and anticipates future states.

Impact: 30% better at tasks requiring memory (e.g., "return to base")

2. Language-Conditioned Visual Grounding

When given instruction like "find the tallest tree," SIMA 2 must understand "tallest" (comparative reasoning), identify multiple trees in view, estimate heights from visual cues, and select the tallest one.

Implementation: Cross-modal attention between text tokens and visual patches. Language primes which visual features to attend to.

3. Hierarchical Planning with Backtracking

Complex tasks require breaking down goals:

High-level: "Build a house"
Mid-level: "Gather 50 wood, 30 stone, craft walls, place foundation..."
Low-level: "Walk to tree, use axe, collect wood"

Innovation: SIMA 2 can backtrack and replan when a subtask fails. If no wood nearby, it switches to "explore map" mode rather than getting stuck.

4. Game-Agnostic Action Space

Different games use different controls. SIMA 2 learns a canonical action space that maps to game-specific inputs:

Canonical Action	Minecraft	Valheim	No Man's Sky
Move forward	W key	W key	W key
Jump	Space	Space	Melee button (contextual)
Use tool	Left click	Left click	E key
Interact	Right click	E key	E key

Advantage: Skills learned in one game automatically transfer to similar games with minimal adaptation.

Comparing SIMA 2 to Other AI Game Players

System	Approach	Games	Generalization
OpenAI VPT	Imitation learning from videos	Minecraft only	Single game
DeepMind AlphaStar	Reinforcement learning	StarCraft II only	Single game
NVIDIA Voyager	LLM-generated code	Minecraft only	Single game
Meta CICERO	Language + planning	Diplomacy only	Single game
SIMA 2	Vision + LLM + self-play	21+ games	Cross-game

Key differentiator: All prior systems are game-specific. SIMA 2 is the first to demonstrate true generalization across diverse 3D games without per-game training.

Performance Benchmarks

Task Success Rate (New Games, No Training)

SIMA 1: 31% average across held-out games
SIMA 2: 65% average (+34 percentage points)
Best case (Valheim → ASKA): 78%
Worst case (Goat Simulator → Minecraft): 42%

Reasoning Capabilities

Multi-step planning: 85% success on 5-step task chains
Language grounding: 91% accuracy on "find [object]" tasks
Error recovery: 68% success rate at recovering from failed actions (vs 23% for SIMA 1)

Efficiency Metrics

Sample efficiency: 10x less demonstration data needed than SIMA 1
Inference speed: 10 FPS (100ms/frame) on consumer GPU
Memory footprint: 8GB VRAM (vs 24GB for SIMA 1)

Limitations and Failure Modes

SIMA 2 is impressive, but not perfect. Current limitations include:

1. Struggles with Precise Timing

Games requiring pixel-perfect jumps or frame-perfect combos (e.g., Cuphead, Dark Souls) are still challenging. SIMA 2's 10 FPS perception and 100ms latency make twitch-reflex gameplay difficult.

2. Limited Adversarial Robustness

Against skilled human players in PvP, SIMA 2 loses ~80% of matches. It can't yet match human strategic depth in competitive scenarios.

3. Text-Heavy Games

Games with complex dialogue systems or text-based puzzles (RPGs, visual novels) are outside SIMA 2's current scope. It excels at physical interaction but struggles with reading comprehension.

4. Novel Mechanics

Completely unique game mechanics (e.g., time manipulation in Braid, portal physics in Portal) require more demonstration data. SIMA 2's prior knowledge helps but isn't sufficient for truly novel concepts.

5. Long-Horizon Planning

Tasks requiring 30+ steps (e.g., "beat the entire game") often fail because SIMA 2 loses track of the overall goal or gets stuck in local optima.

What This Means for the Future

SIMA 2's architecture represents a template for general-purpose embodied agents. The same principles apply to:

Physical Robotics

The perception → reasoning → action pipeline transfers directly to robot manipulation. Instead of game visuals, feed it camera data. Instead of keyboard inputs, output motor commands.

DeepMind has already demonstrated this with robot arms trained on SIMA 2's predecessor, learning to pick and place objects in cluttered environments, follow natural language instructions, and generalize to new objects without per-object training.

Virtual Assistants for Complex Software

Imagine SIMA 2-like agents that can:

Navigate unfamiliar software interfaces ("edit this video")
Debug code by observing error behavior
Automate workflows by watching human demonstrations

Autonomous Vehicles

The same visual reasoning and planning needed for games applies to driving:

Perceive road conditions (vision)
Plan routes (reasoning)
Control steering/acceleration (action)

Timeline: DeepMind's internal roadmap suggests 3-5 years before SIMA descendants move from research to real-world applications.

How SIMA 2 Differs from ChatGPT/Claude

A common question: "Isn't this just ChatGPT playing games?"

Capability	ChatGPT/Claude	SIMA 2
Primary skill	Language understanding/generation	Visual perception + motor control
Operates in	Text conversations	Interactive 3D environments
Learning style	Pre-training + fine-tuning	Continuous self-improvement
Generalization	Across language tasks	Across physical tasks
Real-time?	No (async text)	Yes (10 FPS gameplay)

Think of it this way:

ChatGPT: A brilliant scholar who reads everything but lives in a library
SIMA 2: A skilled athlete who learns by doing, can't write essays but can navigate complex physical spaces

Frequently Asked Questions

Can I run SIMA 2 on my own computer?

Not yet. SIMA 2 is currently in limited research preview with no public release. When/if it becomes available, expect hardware requirements of: 8GB+ VRAM (RTX 3070 or better), 32GB RAM, and a modern CPU (Ryzen 5000 / Intel 11th gen+).

How does SIMA 2 handle games with randomness?

Stochastic environments (random loot, enemy spawns) are handled through the self-improvement loop. SIMA 2 plays thousands of episodes, learning robust strategies that work across different random seeds.

Can it play any game?

No. Requirements include: 3D first/third-person perspective (not 2D platformers), real-time gameplay (not turn-based), and visual feedback (not text-only).

Does it cheat by reading game memory?

Absolutely not. SIMA 2 only has access to screen pixels (what you see), keyboard/mouse inputs (what you control), and optional language instructions. No API access, no game code, no hidden information.

Could this lead to AGI?

SIMA 2 is a step toward embodied AGI—systems that can act intelligently in physical/virtual environments. It's not AGI yet (lacks long-term memory, can't transfer to completely different domains like language or math). But it's closer than pure language models like GPT-4.

Conclusion: Why Technical Architecture Matters

Understanding how SIMA 2 works isn't just academic curiosity. It reveals:

The path to general AI: Not through larger language models, but through multimodal systems that perceive, reason, and act
Real-world applications: This tech will power robots, autonomous systems, and virtual assistants within 5 years
What's still missing: Human-level performance requires better long-term memory, faster reasoning, and deeper world models

For developers, SIMA 2's open problems are opportunities: improve sample efficiency, enhance adversarial robustness, extend to text-heavy games, and scale to longer horizons.

The next breakthrough likely comes from combining SIMA 2's embodied intelligence with large language models' abstract reasoning—creating agents that can both do and think.

How SIMA 2 Works: From Pixels to Actions