An interactive robot framework accomplishes long-horizon task planning and can easily generalize to new goals or distinct tasks, even during execution. However, most traditional methods require predefined module design, which makes it hard to generalize to different goals. Recent large language model based approaches can allow for more open-ended planning but often require heavy prompt engineering or domain specific pretrained models. To tackle this, we propose a simple framework that achieves interactive task planning with language models. Our system incorporates both high-level planning and low-level function execution via language. We verify the robustness of our system in generating novel high-level instructions for unseen objectives and its ease of adaptation to different tasks by merely substituting the task guidelines, without the need for additional complex prompt engineering. Furthermore, when the user sends a new request, our system is able to replan accordingly with precision based on the new request, task guidelines and previously executed steps.
The ITP system integrates advanced language and vision models to facilitate efficient robotic task understanding and execution. Below is a breakdown of the 3 major components:
1. Visual Scene Grounding: This module transforms visual data into a natural language representation. We utilize Grounding-DINO to process the visual input. In the context of our drink making system, it discerns ingredients and their respective locations in the form of approximate x, y coordinates.
2. LLMs for Planning and Execution: We employ GPT-4 as our LLM backbone. Our approach employs two language agents. The high level planner takes as input a given prompt, task guidelines, and a user request, and outputs a step-by-step plan to execute the request. The second LLM, provided with information about the scene and robot skills, takes each generated step and attempts to execute it.
3. Robot Skill Grounding: The language model interfaces with a predefined skill set in Python that controls the robot. These skills are translated into a functional API by parsing of function definitions and related doc strings. This can be directly used with GPT's function-calling layer
See a detailed diagram of our system below, which includes the prompts we used in each stage:
@misc{li2023itp,
title={Interactive Task Planning with Language Models},
author={Boyi Li and Philipp Wu and Pieter Abbeel and Jitendra Malik},
year={2023},
}