Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution

IROS 2024

Francesco Argenziano 1, Michele Brienza 1, Vincenzo Suriani 2, Daniele Nardi1, Domenico Daniele Bloisi 3,
1Sapienza University of Rome 2University of Basilicata 3International University of Rome
Architecture
Complete architecture of EMPOWER, from the task description to the execution of the plan in the world. The RGB image is used to extract a graph of the scene as long as the final plan and the object labels. These labels are then grounded via an NLP pipeline and reprojected onto the pointclouds extracted from the depth image of the robot. Lastly, reference points are computed from these pointclouds to facilitate actions in the world. Use case illustrated: order the objects on the table from the highest to the lowest.

Abstract

Task planning for robots in real-life settings presents significant challenges. These challenges stem from three primary issues: the difficulty in identifying grounded sequences of steps to achieve a goal; the lack of a standardized mapping between high-level actions and low-level commands; and the challenge of maintaining low computational overhead given the limited resources of robotic hardware. We introduce EMPOWER, a framework designed for open-vocabulary online grounding and planning for embodied agents aimed at addressing these issues. By leveraging efficient pre-trained foundation models and a multi-role mechanism, EMPOWER demonstrates notable improvements in grounded planning and execution. Quantitative results highlight the effectiveness of our approach, achieving an average success rate of 0.73 across six different real-life scenarios using a TIAGo robot.

Approach

Our work consists of three components: a planner that generates a plan from a task description with a multi-role architecture; a model that grounds the plan in the environment using NLP techniques and an executor that executes the plan on a robot mapping the high-level actions to low-level commands. Thanks to the use of The contributions of our work are the following:

  • We propose a multi-role, single-body hierarchical subtask division that enables complex reasoning about the scene and allows for the creation of plans that can be grounded in the environment;
  • We present a complete pipeline that incorporates FMs for extracting meaningful information from the world, while being resilient to the online constraint given by the limited computational resources of real robots;
  • We provide a flexible solution that, thanks to our novel open-vocabulary framework, grounds and deploys plans across different robotic platforms in the environment.
  • Our approach enables planning for embodied agents in an open vocabulary setting through NLP-driven semantic comprehension of scene elements. We first assess the improvement in high-level planning by comparing our multi-role VLM-based architecture to a single-role approach. Subsequently, we evaluate the system's full planning capabilities, from high to low levels, by executing tasks with a TIAGo robot in real environments and measuring the success rate.

    Experiments

    Our tests focus on six representative use cases that we have designed to test the planning capabilities of the LLMs for indoor tasks that require multi-step reasoning and manipulations. These cases are: sort object by their height, grab a jacket on the coat rack, throw the objects in the right recycle bins, order the shelf to have 2 objects per level, order the shelf depending on the objects’ material, exit the room.

    BibTeX

    @article{argenziano2024empowerembodiedmultiroleopenvocabulary,
            title={EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution}, 
            author={Francesco Argenziano and Michele Brienza and Vincenzo Suriani and Daniele Nardi and Domenico D. Bloisi},
            year={2024},
            eprint={2408.17379},
            archivePrefix={arXiv},
            primaryClass={cs.RO},
            url={https://arxiv.org/abs/2408.17379}, }