Large Language Models (LLMs) and Visual Language Models (VLMs) are gaining growing interest due to their increasing performance and application to various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the domain of the problem is needed. For example, when planning and perception are required simultaneously, these models tend to fail because of their difficulty in merging multi-modal information. To tackle this problem, fine-tuned models are usually used and trained on ad-hoc data structures representing the environment. This solution has limited effectiveness since it can make the context too complex for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without requiring specific data structures as input. Instead, it utilizes a single image of the environment, coping with free-form domains leveraging commonsense knowledge. We also propose a novel, fully automatic evaluation procedure, PG2S, designed to better describe the quality of a plan. The widely recognized ALFRED dataset has been used to validate our approach. In particular, we compared PG2S with respect to the existing KAS metric to further assess the quality of the obtained plans.
Our work is based on the idea of using a multi-agent architecture to solve the task of embodied task planning. The agents are designed to work in a cooperative way, each one with a specific role. The agents are:
The plans are evaluated using the ALFRED dataset. Here we show examples of the plans generated by our method. The enviroments are scene images of an home regarding hometasks. The task descriptions are given in natural language. The plans are generated by our method and compared with the ground truth.
@article{brienza2024multi,
title={Multi-agent Planning using Visual Language Models},
author={Brienza, Michele and Argenziano, Francesco and Suriani, Vincenzo and Bloisi, Domenico D and Nardi, Daniele},
journal={arXiv preprint arXiv:2408.05478},
year={2024}
}