Open-Vocabulary 3D Scene Understanding towards Embodied Manipulation

Staff - Faculty of Informatics

Date: 7 June 2024 / 16:15 - 17:00

USI East Campus, Room C1.04

Speaker: Francis Engelmann - ETH Zurich

Abstract: 3D scene understanding is a key ability of humans (and many other living species) to navigate and interact with the environment around us. Bringing these capabilities to intelligent devices (e.g., household robots, smart glasses) is a key effort in current 3D vision research and embodied AI. In this talk, I will present general deep learning models to address a wide variety of 3D scene understanding tasks across multiple modalities, including 3D instance segmentation, vectorized floorplan estimation and human body-part segmentation.

In the second part of the talk, I will discuss multi-modal foundation models for 3D scene understanding. In particular large vision-language models (VLM) which enable possibilities that go well beyond the conventional closed-set 3D vision methods, which are constrained to predefined object categories. Using this new paradigm, we can alleviate these strict constraints and obtain open-vocabulary 3D scene representations for querying arbitrary object classes, recognizing scene functionalities, affordances, and more.

Biography: Francis Engelmann is a postdoctoral researcher at ETH Zurich collaborating with Prof. Marc Pollefeys and a visiting researcher at Google Zurich collaborating with Federico Tombari. His current research interests are at the intersection of deep learning, computer vision, and large visual-language models. His research focuses on 3D scene understanding and representations for open-vocabulary search and manipulation. Prior to joining ETH Zurich, he obtained his Ph.D. from RWTH Aachen with Prof. Bastian Leibe. Francis is a Fellow of the ETH AI Center, a member of the ELLIS Society, and a recipient of ETHZ Career Seed Award and SNSF Postdoc.Mobility fellowship.

Host: Prof. Marc Langheinrich