Date: Thursday, October 24th 2024

Time: 5:00 PM – 7:00 PM EST  / 2:00 PM – 4:00 PM PST

Location: Zoom (https://gatech.zoom.us/j/5825218212?pwd=NnBMcmNDTlFoNVcxTC91dndacFRadz09)

 

Committee:

Dr. Sehoon Ha (Advisor) – School of Interactive Computing, Georgia Institute of Technology

Dr. Dhruv Batra (Advisor) – School of Interactive Computing, Georgia Institute of Technology

Dr. Jie Tan – School of Interactive Computing, Georgia Institute of Technology

Dr. Vladlen Koltun – Distinguished Scientist, Apple

Dr. Mrinal Kalakrishnan – Research Lead, Meta

 

Title:

From Web to World: Harnessing Foundation Models for Intelligent Robotic Assistants in Real-World Environments

 

Abstract:

In this thesis, we explore how simulated embodied experience and spatial grounding can be leveraged to 'embody' foundation models for robotics, bridging the gap between their abstract reasoning capabilities and the physical realities of robotic interaction. We present three key contributions: (1) Adaptive Skill Coordination (ASC) and Language-guided Skill Coordination (LSC), approaches for open-vocabulary long-horizon mobile manipulation tasks that demonstrate how simulators can be used to develop fundamental sensorimotor skills, creating a robust 'body' of capabilities that foundation models can employ to interact with the real world. (2) Vision-Language Frontier Maps (VLFM), an approach that combines pre-trained vision-language models with low-level navigation policies trained in simulation. By grounding pre-trained vision-language models with explicit spatial maps of the environment, VLFM enhances their ability to reason about and navigate in the real world. (3) A proposed approach to fine-tune vision-language models using simulated data to enhance their spatial-temporal reasoning for navigation tasks. By exposing these models to diverse simulated scenarios, we hypothesize they will develop a more nuanced understanding of physical interactions, causality, and temporal dynamics. This research aims to create embodied AI systems that can leverage the strengths of foundation models while effectively operating in real-world environments.