LOST-3DSG: Lightweight Open-Vocabulary 3D Scene Graphs with Semantic Tracking in Dynamic Environments

WACVW 2026

Sara Micol Ferraina^*1, Michele Brienza^*1, Francesco Argenziano¹, Emanuele Musumeci¹,
Vincenzo Suriani¹, Domenico D. Bloisi², Daniele Nardi¹
^*Authors contributed equally

¹Sapienza University of Rome ²International University of Rome UNINT

Paper arXiv Code

Abstract

Tracking objects that move within dynamic environments is a core challenge in robotics. Recent research has advanced this topic significantly; however, many existing approaches remain inefficient due to their reliance on heavy foundation models. To address this limitation, we propose LOST-3DSG, a lightweight open-vocabulary 3D scene graph designed to track dynamic objects in real-world environments. Our method adopts a semantic approach to entity tracking based on word2vec and sentence embeddings, enabling an open-vocabulary representation while avoiding the necessity of storing dense CLIP visual features. As a result, LOST-3DSG achieves superior performance compared to approaches that rely on high-dimensional visual embeddings. We evaluate our method through qualitative and quantitative experiments conducted in a real 3D environment using a TIAGo robot. The results demonstrate the effectiveness and efficiency of LOST-3DSG in dynamic object tracking.

Approach

LOST-3DSG consists of two main components: the Perception Module and the Scene Update Module. The key contributions of our work are:

Lightweight 3D Scene Graph: We introduce a memory-efficient representation that uses word2vec and sentence embeddings instead of dense CLIP features, reducing memory footprint by 99.5% (from ~641 MB to ~3 KB).

Lost Similarity Function (LSF): A composite metric that combines semantic similarity (labels), chromatic similarity (colors), material similarity, and description similarity to robustly associate objects across time.

Semantic Tracking Algorithm: Our Scene Update Module operates in two modes (exploration and tracking), maintaining object identity through semantic attributes rather than visual features alone.

Real-world Validation: We evaluate our system on a TIAGo robot across three scenarios of increasing complexity, demonstrating accurate tracking in dynamic environments with minimal computational overhead.

The system processes RGB-D observations to extract object labels, colors, materials, and fine-grained descriptions using a Vision-Language Model. These semantic attributes are then encoded using lightweight embeddings and used for temporal association, enabling the robot to recognize when objects move, disappear, or reappear in the scene.

BibTeX

@misc{ferraina2026lost3dsglightweightopenvocabulary3d,
      title={LOST-3DSG: Lightweight Open-Vocabulary 3D Scene Graphs with Semantic Tracking in Dynamic Environments}, 
      author={Sara Micol Ferraina and Michele Brienza and Francesco Argenziano and Emanuele Musumeci and Vincenzo Suriani and Domenico D. Bloisi and Daniele Nardi},
      year={2026},
      eprint={2601.02905},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2601.02905}, 
}