About AI Minecraft – Oasis AI Minecraft
The first playable, realtime, open-world AI model that generates gameplay on a frame-by-frame basis, developed by Decart AI.
🎮 Project Overview
Oasis AI Minecraft, developed by Decart AI in collaboration with Etched, represents a groundbreaking achievement in AI gaming technology. It’s an interactive video game generated end-to-end by a transformer on a frame-by-frame basis.
Unlike traditional games, Oasis takes in user keyboard and mouse input and generates real-time gameplay, internally simulating physics, game rules, and graphics. The model learned to allow users to move around, jump, pick up items, break blocks, and more, all by watching gameplay directly.
This revolutionary project combines cutting-edge AI research with advanced hardware optimization, marking the first step towards foundational models that simulate more complex interactive worlds, potentially replacing classic game engines in an AI-driven future.
⚡ Technical Architecture
🔄 Building a New Interactive Architecture
We ran hundreds of architectural and data experiments to identify the best architecture for fast autoregressive interactive video generation. Unlike traditional bidirectional models, our architecture is specifically designed for real-time, frame-by-frame generation with user input conditioning.

Oasis’s ViT + DiT architecture featuring Transformer-based variational autoencoder and accelerated spatiotemporal attention
🎯Key Features
💡 Technical Innovations
🧠 AI Model Technology
Oasis utilizes a combination of diffusion training and transformer models, inspired by advanced large-language-models (LLMs). The model generates video on a frame-by-frame basis, conditioned by user actions at each instant.
The architecture features a Transformer-based variational autoencoder (ViT VAE) to compress the image size and enable the diffusion to focus on higher-level characteristics, along with an accelerated axial, causal spatiotemporal attention mechanism.
🎯 Diffusion Model Innovation
Unlike bidirectional models, Oasis generates frames autoregressively, with the ability to condition each frame on game input. This enables users to interact with the world in real-time rather than just rendering videos retroactively.
The model employs diffusion-forcing techniques and includes additional temporal attention layers interleaved between spatial attention layers to provide context from previous frames.
⚙️ Performance & Optimization
🚀 Current Capabilities
💫 Future Optimizations
📊 Sohu Enables 10x More Users
(Performance analysis using Oasis architecture scaled up to 100B params)
20 FPS
Real-time Frame Rate
20 FPS
Sohu Chip Supports Resolution
100x
Faster than Current Models
10x
Can Serve More Users
🔮 Future Development
🎯 Current Challenges
🌟 Future Vision
The team is actively working on scaling the model and datasets, alongside developing additional optimization techniques to enable efficient large-scale training.
Beyond gaming, Oasis aims to expand into full interactive multimodal video generation, potentially revolutionizing how we interact with digital content and entertainment platforms.
