The first playable, realtime, open-world AI model that generates gameplay on a frame-by-frame basis, developed by Decart AI.
Oasis AI Minecraft, developed by Decart AI in collaboration with Etched, represents a groundbreaking achievement in AI gaming technology. It's an interactive video game generated end-to-end by a transformer on a frame-by-frame basis.
Unlike traditional games, Oasis takes in user keyboard and mouse input and generates real-time gameplay, internally simulating physics, game rules, and graphics. The model learned to allow users to move around, jump, pick up items, break blocks, and more, all by watching gameplay directly.
This revolutionary project combines cutting-edge AI research with advanced hardware optimization, marking the first step towards foundational models that simulate more complex interactive worlds, potentially replacing classic game engines in an AI-driven future.
We ran hundreds of architectural and data experiments to identify the best architecture for fast autoregressive interactive video generation. Unlike traditional bidirectional models, our architecture is specifically designed for real-time, frame-by-frame generation with user input conditioning.
Oasis's ViT + DiT architecture featuring Transformer-based variational autoencoder and accelerated spatiotemporal attention
Oasis utilizes a combination of diffusion training and transformer models, inspired by advanced large-language-models (LLMs). The model generates video on a frame-by-frame basis, conditioned by user actions at each instant.
The architecture features a Transformer-based variational autoencoder (ViT VAE) to compress the image size and enable the diffusion to focus on higher-level characteristics, along with an accelerated axial, causal spatiotemporal attention mechanism.
Unlike bidirectional models, Oasis generates frames autoregressively, with the ability to condition each frame on game input. This enables users to interact with the world in real-time rather than just rendering videos retroactively.
The model employs diffusion-forcing techniques and includes additional temporal attention layers interleaved between spatial attention layers to provide context from previous frames.
The team is actively working on scaling the model and datasets, alongside developing additional optimization techniques to enable efficient large-scale training.
Beyond gaming, Oasis aims to expand into full interactive multimodal video generation, potentially revolutionizing how we interact with digital content and entertainment platforms.