Learning to Act with Affordance-Aware Multimodal Neural SLAM

Zhiwei Jia, Kaixiang Lin, Yizhou Zhao, Qiaozi Gao, Govind Thattai, Gaurav Sukhatme

UC San Diego, Amazon Alexa AI, University of Southern California

Paper

Slides

Code

Introduction

Planning is a key bottleneck for Embodied AI tasks that involve both navigation and object manipulation. To tackle this, we propose a Neural SLAM approach that, for the first time, utilizes several modalities for exploration, predicts an affordance-aware semantic map, and then plans over it. This significantly improves exploration efficiency, leads to robust long-horizon planning, and enables effective vision-and-language grounding. With the proposed Affordance-aware Multimodal Neural SLAM (AMSLAM) approach, we obtain more than 40% improvement over previously published work on the ALFRED benchmark (see illustration on the right for a sample task).

Challenges in ALFRED

ALFRED is a popular Embodied AI task hosted in AI2Thor that asks home-assistant agents to follow human instructions. It features multimodal inputs and requires a generalizable model for long-horizon planning as well as for the effective exploration of indoor scenes. Our key observation is that affordance-aware navigation is another major bottleneck, i.e., navigations should satisfy the pre-conditions for the potential follow-up object manipulations. For instance, to “grab a beer from the fridge”, a robot should navigate to the right side of the fridge, otherwise, the robot cannot reach inside.

Affordance-aware Semantic Representation

To deal with affordance-aware navigation, we propose to use waypoint-oriented semantic maps. Specifically, waypoints are the agent’s poses from which each object type in the scene is interactable. (e.g., only certain poses allow the robot to reach inside a fridge). These maps are first predicted from a series of egocentric observations using a module trained with privileged information in simulation and then aggregated according to odometry. As a result, planning becomes robust and as simple as finding paths across different waypoints for their associated manipulation tasks.

An illustration of the semantic map of a kitchen. Besides the navigable area (red), we show waypoints (stars and arrows) for certain object interactions.

Task-driven Multimodal Exploration

Effective exploration of the scene is a prerequisite for building semantic maps required for Embodied AI tasks. Here we propose a task-driven multimodal exploration module that leverages human instructions originally for task executions to predict exploration actions. The module is learned supervisedly (behavior cloning) and we find it more efficient and effective than exploration policies w\ fewer modalities. More details in slides.

Affordance-aware Multimodal Neural SLAM

Here we introduce the whole picture of our proposed neural SLAM-based system, abbreviated as AMSLAM. It consists of two phases.

Exploration phase: The agent explores the environment (task-driven multimodel exploration) and acquires the necessary information for the indoor scenes (affordance-aware semantic representation).
Execution phase: The agent plans, navigates and interacts with the scene according to the map, a path planner, and an object manipulation policy network. We use a learning-based subgoal parser to handle navigation and object manipulation (interaction) actions separately. See more details in slides or in the paper.

Visualized execution for “move two sprayers from the shelf to the back of a toilet” from AMSLAM

Experiment Results

AMSLAM significantly surpasses previous approaches regarding generalization performance in unseen environments. See ablation studies that justify our design in slides or in the paper.

Key Takeaways

Affordance-aware navigation is a major bottleneck for long-horizon tasks in Embodied AI.
Multimodal signals including language can help to improve the exploration of the environment.
Affordance-aware representation learning of scenes is a key to generalizable long-horizon planning.

Paper

Learning to Act with Affordance-Aware Multimodal Neural SLAM

Zhiwei Jia, Kaixiang Lin, Yizhou Zhao, Qiaozi Gao, Govind Thattai, Gaurav Sukhatme

[IROS 2022] [pdf]