Multi-agent Planning using Visual Language Models

ECAI 2024

Michele Brienza 1, Francesco Argenziano 1, Vincenzo Suriani 2, Domenico Daniele Bloisi 3,
Daniele Nardi1,
1Sapienza University of Rome 2University of Basilicata 3International University of Rome
Architecture
Complete and detailed architecture of the proposed method. The task description and the image are given in input to the agents that extract meaningful information from the scene. Their output is then processed by the planner agent that obtain the final plan. Such plan is then compared with the ground truth and evaluated according our new metric that takes into account semantically meaningful information.

Abstract

Large Language Models (LLMs) and Visual Language Models (VLMs) are gaining growing interest due to their increasing performance and application to various domains and tasks. However, LLMs and VLMs can produce erroneous results, especially when a deep understanding of the domain of the problem is needed. For example, when planning and perception are required simultaneously, these models tend to fail because of their difficulty in merging multi-modal information. To tackle this problem, fine-tuned models are usually used and trained on ad-hoc data structures representing the environment. This solution has limited effectiveness since it can make the context too complex for processing. In this paper, we propose a multi-agent architecture for embodied task planning that operates without requiring specific data structures as input. Instead, it utilizes a single image of the environment, coping with free-form domains leveraging commonsense knowledge. We also propose a novel, fully automatic evaluation procedure, PG2S, designed to better describe the quality of a plan. The widely recognized ALFRED dataset has been used to validate our approach. In particular, we compared PG2S with respect to the existing KAS metric to further assess the quality of the obtained plans.

Approach

Our work is based on the idea of using a multi-agent architecture to solve the task of embodied task planning. The agents are designed to work in a cooperative way, each one with a specific role. The agents are:

  • Semantic Knowledge Miner Agent: it is responsible for extracting meaningful relations among objects from the image of the environment.
  • Grounded Knowledge Miner Agent: it is responsible for understanding the task description describing the image of the enviroment.
  • Planner Agent: it is responsible for generating the plan using the information provided by the previous agents.
Starting from an image of the environment and the task description, the agents work together to generate the plan. To evaluate the quality of the plan, we propose a new metric, PG2S, that takes into account semantically meaningful information. Using the goal-wise similarity and the sentence-wise similarity, PG2S is able to compare two plans between ground truth and predicted one. The goal-wise similarity is computed by comparing the main actions of the both plans matching the actions of each sentence. The actions are obtained by extracting the main verbs and nouns from the sentences using POS tagging. The sentence-wise similarity is computed by comparing the sentences of the both plans using sentence transformers.

Experiments

The plans are evaluated using the ALFRED dataset. Here we show examples of the plans generated by our method. The enviroments are scene images of an home regarding hometasks. The task descriptions are given in natural language. The plans are generated by our method and compared with the ground truth.

BibTeX

@article{brienza2024multi,
  title={Multi-agent Planning using Visual Language Models},
  author={Brienza, Michele and Argenziano, Francesco and Suriani, Vincenzo and Bloisi, Domenico D and Nardi, Daniele},
  journal={arXiv preprint arXiv:2408.05478},
  year={2024}
}