About AI Minecraft - Oasis AI Minecraft

The first playable, realtime, open-world AI model that generates gameplay on a frame-by-frame basis, developed by Decart AI.

๐ŸŽฎ Project Overview

Oasis AI Minecraft, developed by Decart AI in collaboration with Etched, represents a groundbreaking achievement in AI gaming technology. It's an interactive video game generated end-to-end by a transformer on a frame-by-frame basis.

Unlike traditional games, Oasis takes in user keyboard and mouse input and generates real-time gameplay, internally simulating physics, game rules, and graphics. The model learned to allow users to move around, jump, pick up items, break blocks, and more, all by watching gameplay directly.

This revolutionary project combines cutting-edge AI research with advanced hardware optimization, marking the first step towards foundational models that simulate more complex interactive worlds, potentially replacing classic game engines in an AI-driven future.

โšก Technical Architecture

๐Ÿ”„ Building a New Interactive Architecture

We ran hundreds of architectural and data experiments to identify the best architecture for fast autoregressive interactive video generation. Unlike traditional bidirectional models, our architecture is specifically designed for real-time, frame-by-frame generation with user input conditioning.

Oasis's ViT + DiT architecture

Oasis's ViT + DiT architecture featuring Transformer-based variational autoencoder and accelerated spatiotemporal attention

๐ŸŽฏKey Features

  • Frame-by-frame generation conditioned on user input
  • Transformer-based variational autoencoder (ViT VAE)
  • Accelerated axial, causal spatiotemporal attention mechanism

๐Ÿ’กTechnical Innovations

  • Dynamic noise at inference time for increased stability
  • Optimized inference kernels for real-time performance
  • Additional temporal attention layers for frame context

๐Ÿง  AI Model Technology

Oasis utilizes a combination of diffusion training and transformer models, inspired by advanced large-language-models (LLMs). The model generates video on a frame-by-frame basis, conditioned by user actions at each instant.

The architecture features a Transformer-based variational autoencoder (ViT VAE) to compress the image size and enable the diffusion to focus on higher-level characteristics, along with an accelerated axial, causal spatiotemporal attention mechanism.

๐ŸŽฏ Diffusion Model Innovation

Unlike bidirectional models, Oasis generates frames autoregressively, with the ability to condition each frame on game input. This enables users to interact with the world in real-time rather than just rendering videos retroactively.

The model employs diffusion-forcing techniques and includes additional temporal attention layers interleaved between spatial attention layers to provide context from previous frames.

โš™๏ธ Performance & Optimization

๐Ÿš€ Current Capabilities

  • Achieves 47ms inference time per frame using Decart's proprietary inference framework
  • Runs at 360p resolution at 20fps on NVIDIA H100 GPUs
  • Optimized for real-time web browser gameplay with minimal latency

๐Ÿ’ซ Future Optimizations

  • Etched's Sohu chip will enable 4K resolution gameplay
  • Can serve 10x more users than current hardware at the same price and power consumption
  • Aims to make high-quality AI-generated gaming more accessible and cost-effective

๐Ÿ“Š Sohu Enables 10x More Users

(Performance analysis using Oasis architecture scaled up to 100B params)
20 FPS
Real-time Frame Rate
4K
Sohu Chip Supports Resolution
100x
Faster than Current Models
10x
Can Serve More Users

๐Ÿ”ฎ Future Development

๐ŸŽฏ Current Challenges

  • Improving model memory for better detail retention across frames
  • Enhancing output clarity and reducing haziness in certain situations
  • Handling edge cases and inputs outside the model's training distribution

๐ŸŒŸ Future Vision

The team is actively working on scaling the model and datasets, alongside developing additional optimization techniques to enable efficient large-scale training.

Beyond gaming, Oasis aims to expand into full interactive multimodal video generation, potentially revolutionizing how we interact with digital content and entertainment platforms.

๐Ÿ“š Documentation & Resources