An open-source framework to enable multimodal AI models like GPT-4-Vision to operate a computer. Using the same inputs and outputs of a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.