Learning To Play Minecraft With Video PreTraining (VPT)

The internet contains an enormous amount of publicly available videos that we can learn from. You can watch a person make a gorgeous presentation, a digital artist draw a beautiful sunset, and a Minecraft player build an intricate house. These videos are merely a record of what happened, but they do not show the exact process. It is impossible to know exactly what sequence of mouse movements were used or which keys were pressed. We will not be able to build large-scale foundation models in these areas as we did in language with GPT. This is a new challenge that we don't have in the language domain where "action labels" are just the next words in each sentence.

We introduce Video PreTraining (VPT) to help you make the most of the unlabeled video data on the internet. It is a simple and effective semi-supervised imitation learning technique. We begin by gathering data from contractors. We record not only the video but also the actions taken, which in our case is keypresses, and mouse movements. With this data we train an inverse dynamics model (IDM), which predicts the action being taken at each step in the video. The IDM can use both past and future information to predict the action at each step. This task is much easier and thus requires far less data than the behavioral cloning task of predicting actions given past video frames only, which requires inferring what the person wants to do and how to accomplish it. The trained IDM will then be able to label a greater number of online video clips and learn to behave via behavioral cloning.

VPT Zero-Shot Results

Our method was validated in Minecraft because (1) it is the most played video game in the world, and therefore has a lot of video data. (2) It is open-ended and offers a wide range of activities that are similar to real-world applications like computer usage. Our AI uses the 20Hz framerate with the mouse, keyboard, and mouse, which is a departure from previous Minecraft AI works.

Our behavioral cloning model, the "VPT foundation model", has been trained on 70,000 hours IDM-labeled online videos. It can accomplish tasks in Minecraft that are almost impossible to achieve using reinforcement learning from scratch. It learns to cut down trees to get logs, make planks from them, and then craft the planks into a crafting board. Books and stuff This takes approximately 50 seconds or 1,000 game actions for a human Minecraft expert.

Additionally, the model can perform other complex skills that are common in the game such as swimming and hunting animals for food. It also learned the skill of "pillar jumping", a common behavior in Minecraft of elevating yourself by repeatedly jumping and placing a block underneath yourself.

Fine-tuning with behavioral Cloning

Foundation models are designed with a broad behavior profile to be able to perform a wide range if tasks. It is common to fine-tune foundation models to smaller, more specific datasets to incorporate new knowledge or allow them a narrower task distribution. As a case study into how well the VPT foundation model can be fine-tuned to downstream datasets, we asked our contractors to play for 10 minutes in brand new Minecraft worlds and build a house from basic Minecraft materials. This would increase the foundation model's reliability in performing "early game" skills like building crafting tables. We see a dramatic improvement in the foundation model's ability for reliable execution of the early game skills. The model can also be fine-tuned to create wooden and stone tools. Sometimes, we can even see basic shelter construction and the agent searching for villages, including raiding chests.

BC fine-tuning results in improved early game behavior

Data Scaling

Perhaps the most important hypothesis of our work is that it is far more effective to use labeled contractor data to train an IDM (as part of the VPT pipeline) than it is to directly train a BC foundation model from that same small contractor dataset. To test this hypothesis, we train foundation models with increasing amounts data, ranging from 1 to 70,000. Training on less than 2000 hours of data will result in the use of contractor data with ground-truth labels. For those who have trained on more than 2000 hours, we use internet data labeled using our IDM. We then take each foundation model and fine-tune it to the house building dataset described in the previous section.

Fine-tuning influenced by foundation model training data

As the data in the foundation model increases, we usually see an increase of crafting ability. Only at the highest data scale can we see the emergence stone tool crafting.

Fine-Tuning and Reinforcement Learning

When it is possible to specify a reward function, reinforcement learning (RL) can be a powerful method for eliciting high, potentially even super-human, performance. However, there are many tasks that require you to overcome hard exploration challenges. Many RL methods deal with these problems using random exploration priors, e.g. Models are often incentivized via entropy bonus to act randomly. The VPT model should be a much better prior for RL because emulating human behavior is likely much more helpful than taking random actions. Our model was given the task of collecting a Diamond Pickaxe. This feat is unprecedented in Minecraft and made even more difficult when using the native Human Interface.

The process of crafting a diamond pickaxe is complicated and requires many subtasks. We reward agents for each item in this sequence to make it easier.

A random initialization (the standard RL technique) is the best way to train RL policies. This means that it rarely learns how to collect logs or sticks and doesn't get any rewards. Contrary to this, fine-tuning a VPT model not just teaches it to craft diamondpickaxes (which it does in 2.5% 10-minute Minecraft episodes), but also has a human level success rate at collecting all the items leading to the diamondpickaxe. This is the first time anyone has shown a computer agent capable of crafting diamond tools in Minecraft, which takes humans over 20 minutes (24,000 actions) on average.

Reward episodes with rewards

Conclusion

VPT opens the door to agents learning to act from the many videos available on the internet. Compared to generative video modeling or contrastive methods that would only yield representational priors, VPT offers the exciting possibility of directly learning large scale behavioral priors in more domains than just language. While we only experiment in Minecraft, the game is very open-ended and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g. computer usage.

Please refer to our paper for more information. We are also making available our contractor data, Minecraft model code, and model weights to help future research into VPT. Furthermore, we have partnered with the MineRL NeurIPS competition this year. Contestants can use our models to solve many difficult Minecraft tasks. Those interested can check out the competition webpage and compete for a blue-sky prize of $100,000 in addition to a regular prize pool of $20,000.

Learning To Play Minecraft With Video PreTraining (VPT)

Report Page