About AI Minecraft – Oasis AI Minecraft

The first playable, realtime, open-world AI model that generates gameplay on a frame-by-frame basis, developed by Decart AI.

🎮 Project Overview

Oasis AI Minecraft, developed by Decart AI in collaboration with Etched, represents a groundbreaking achievement in AI gaming technology. It’s an interactive video game generated end-to-end by a transformer on a frame-by-frame basis.

Unlike traditional games, Oasis takes in user keyboard and mouse input and generates real-time gameplay, internally simulating physics, game rules, and graphics. The model learned to allow users to move around, jump, pick up items, break blocks, and more, all by watching gameplay directly.

This revolutionary project combines cutting-edge AI research with advanced hardware optimization, marking the first step towards foundational models that simulate more complex interactive worlds, potentially replacing classic game engines in an AI-driven future.

⚡ Technical Architecture

🔄 Building a New Interactive Architecture

We ran hundreds of architectural and data experiments to identify the best architecture for fast autoregressive interactive video generation. Unlike traditional bidirectional models, our architecture is specifically designed for real-time, frame-by-frame generation with user input conditioning.

Oasis’s ViT + DiT architecture featuring Transformer-based variational autoencoder and accelerated spatiotemporal attention

🎯Key Features

  • Frame-by-frame generation conditioned on user input
  • Transformer-based variational autoencoder (ViT VAE)
  • Accelerated axial, causal spatiotemporal attention mechanism

💡 Technical Innovations

  • Dynamic noise at inference time for increased stability
  • Optimized inference kernels for real-time performance
  • Additional temporal attention layers for frame context

🧠 AI Model Technology

Oasis utilizes a combination of diffusion training and transformer models, inspired by advanced large-language-models (LLMs). The model generates video on a frame-by-frame basis, conditioned by user actions at each instant.

The architecture features a Transformer-based variational autoencoder (ViT VAE) to compress the image size and enable the diffusion to focus on higher-level characteristics, along with an accelerated axial, causal spatiotemporal attention mechanism.

🎯 Diffusion Model Innovation

Unlike bidirectional models, Oasis generates frames autoregressively, with the ability to condition each frame on game input. This enables users to interact with the world in real-time rather than just rendering videos retroactively.

The model employs diffusion-forcing techniques and includes additional temporal attention layers interleaved between spatial attention layers to provide context from previous frames.

⚙️ Performance & Optimization

🚀 Current Capabilities

  • Achieves 47ms inference time per frame using Decart’s proprietary inference framework
  • Runs at 360p resolution at 20fps on NVIDIA H100 GPUs
  • Optimized for real-time web browser gameplay with minimal latency

💫 Future Optimizations

  • Etched’s Sohu chip will enable 4K resolution gameplay
  • Can serve 10x more users than current hardware at the same price and power consumption
  • Aims to make high-quality AI-generated gaming more accessible and cost-effective

📊 Sohu Enables 10x More Users

8xH1008xSohu020406080Users

(Performance analysis using Oasis architecture scaled up to 100B params)

20 FPS

Real-time Frame Rate

20 FPS

Sohu Chip Supports Resolution

100x

Faster than Current Models

10x

Can Serve More Users

🔮 Future Development

🎯 Current Challenges

  • Improving model memory for better detail retention across frames
  • Enhancing output clarity and reducing haziness in certain situations
  • Handling edge cases and inputs outside the model’s training distribution

🌟 Future Vision

The team is actively working on scaling the model and datasets, alongside developing additional optimization techniques to enable efficient large-scale training.

Beyond gaming, Oasis aims to expand into full interactive multimodal video generation, potentially revolutionizing how we interact with digital content and entertainment platforms.