Facebook Achieves New Milestones with Open-Source Embodied AI Navigation Systems

Facebook’s new embodied AI systems will allow for more accurate data of 3D environments.

SoundSpaces simulation render. (Courtesy of Facebook AI.)

SoundSpaces simulation render. (Courtesy of Facebook AI.)

Facebook recently shared the latest milestones from its embodied AI research program. The company achieved three new milestones this year in the form of SoundSpaces, the first audio-visual platform for embodied AI; Semantic MapNet, a platform that can generate top-down semantic maps; and the first place title at the Habitat 2020 point-goal navigation challenge at the Conference on Computer Vision and Pattern Recognition (CPVR) 2020 with a system capable of navigating and exploring spaces even without an immediate view of the area.

Despite these varying achievements, what the three milestones do have in common is that they were all designed to more accurately navigate and interact with the 3D world. Facebook has been persistent in its research and development in embodied AI. With the success of AI Habitat last year—a simulation platform capable of generating photorealistic 3D environments—the company has been pushing to produce more datasets on the technology.

What Is Embodied AI?

Embodied AI is a general term that typically refers to machine learning models installed on robots. Using neural networks, these robots can directly interact with their environments and can often have the ability to navigate. Using AI Habitat, Facebook has already been experimenting with transferring a plethora of skills to a physical robot.

Facebook has already announced that these new tools will be open-sourced in line with its commitment to making contributions to the field of embodied AI navigation.


SoundSpaces, which was only recently open-sourced, is an audio-visual platform that can be used to train robots using highly realistic acoustics in 3D environments. This means that robots can be trained to analyze the sounds in their immediate surroundings instead of relying on just visual data. A real-life application scenario shared by Facebook is when a robotic home assistant is tasked with locating a ringing smartphone. The typical process would be for the robot to visually check every room when tracking the sound can be potentially faster.

The platform will include a collection of audio files containing “geometrical acoustic simulations” that can be integrated for training AI. AI developers can then use the data to create realistic-sounding simulations. Some techniques that can be explored using the data are echolocation and multimodal sensors, to name a few. It will also include information on how waves reflect off surfaces such as walls, as well as how they interact with various materials.

SoundSpaces aims to address gaps in present navigation technology that, according to Facebook, still lacks a multimodic semantic understanding of the 3D world.

“To our knowledge, this is the first attempt to train deep reinforcement learning agents that both see and hear to map novel environments and localize sound-emitting targets,” said Facebook research scientists Kristen Grauman and Dhruv Batra in a recent blog post. “With this approach, we achieved faster training and higher accuracy in navigation than with single modality counterparts.”

Semantic MapNet

Semantic MapNet will also be an open-sourced tool, allowing module developers to integrate spatial memory into their robots for improved navigation. This end-to-end framework is capable of creating “mental maps” that can allow them to easily navigate a space should they encounter it again later. A real-life application scenario would be when an AI robot is introduced to a new environment. MapNet can enable the robot to create a map, which it then can use as a reference for future trips.

The platform also features advanced map quality, which allows the robot to detect even small and hard-to-see objects while simultaneously remembering the position of large objects. Previous embodied AI systems typically relied on standard computer vision to label pixels detected by AI that are used to generate a 2D map. Semantic MapNet improves on this by extracting visual features and projecting them into spatial memory with accurate semantic labels. This allows the AI to utilize multiple observations of a given point in its surrounding area.

The Next Generation of Embodied AI

On top of these new open-source systems, Facebook is already publishing more research in the field of embodied AI to further spur its application in more devices forward. A number of its findings will be presented at this year’s European Conference on Computer Vision (ECCV). Anticipate a new algorithm for room navigation tasks using Habitat, embodied visual exploration, and many more findings. For more information, visit Facebook AI’s blog here.

For more news and stories, check out how Autodesk changed its construction workflow using AI here.