While navigating an unknown facility, humans use various visual, spatial, and semantic cues to help them reach their destination quickly. However, taking advantage of semantic clues and statistical regularities in a new building is difficult for robotic agents. A common strategy would be to use model-free reinforcement learning to learn implicitly what these cues are and how to apply them to navigation tasks. But navigation cues learned in this way are not only expensive to learn, but also difficult to evaluate and reuse in another agent without starting over.
A global model is an attractive alternative for robotic navigation and planning agents. The global model encapsulates rich and relevant information about their environment and allows an agent to make explicit predictions about actionable events in their environment. With spectacular results, these models have sparked a wide interest in robotics, simulation and reinforcement learning. However, real world environments are very complex and diverse compared to gaming environments.
Google AI recently introduced a new global model called Pathdreamer that generates high-resolution 360o visual observations of sections of a building (invisible to an agent) using just a few starting observations and a suggested navigation path. It can create an immersive scene from just one point of view, predicting what an agent would see if they went to a different point of view or even to an unseen area, such as a corner of a corner. This solution can also help autonomous agents navigate the real world by codifying knowledge about human parameters.
Path dreamer architecture
Pathdreamer takes a sequence of past observations as input and generates predictions for a trajectory of future locations, which the agent interacting with the returned observations can propose in advance or iteratively. RGB images, semantic segmentation, and depth are used in both inputs and predictions. Internally, Pathdreamer represents the surfaces of the world with a 3D point cloud. The RGB color value of each point in the cloud is labeled and the semantic segmentation class, such as wall, chair, or table.
When predicting visual sightings in a new location, the point cloud is first re-projected in 2D in the new site to provide “guiding” images. Pathdreamer then generates realistic high-resolution RGB, semantic segmentation, and depth using these images. New observations (real or anticipated) are accumulated in the scatter plot as the model âmovesâ. The use of a point cloud for memory has the advantage of temporal consistency: the revisited regions are represented in the same way as the previous observations.
Pathdreamer works in two steps to convert guide images into compelling and realistic results:
- In the first step, the structure generator creates segmentation and depth images
- In the second step, the image generator transforms them into RGB outputs.
The first step conceptualizes a possible high-level semantic representation of the scene, which is then rendered into a realistic color image in the second step. Convolutional neural networks are used in both stages.
Many different scenes are possible in areas of high ambiguity, such as a location thought to be around the corner or in an unfamiliar room. The structure generator is based on a noise variable, representing stochastic information about the next location that is not recorded in guide frames, including the principles of stochastic video production. Pathdreamer can synthesize varied landscapes by sampling multiple noise variables, allowing an agent to sample various outcomes imaginable for a given route. These different outputs are reflected both in the output of the first stage and in the generated RGB images.
Pathdreamer is able to synthesize realistic visuals as well as continuous video footage after being trained with photographs and 3D environment reconstructions from Matterport3D. Since the output imagery is high resolution and 360o, existing navigation officers can easily adapt it for use with any camera field of view.
Pathdreamer as a visual world model to improve task performance
The researchers used the Pathdreamer to Vision-and-Language Navigation (VLN) task. In VLN, an embodied agent is expected to travel to a location in a realistic 3D world using natural language instructions. They used the Room-to-Room (R2R) dataset in which an instruction follow-up agent prepares in advance by simulating a variety of alternative passable paths in the environment, ranking each against navigation instructions and following the highest ranked path. They analyzed the following three different scenarios:
- Ground-Truth context in which the agent plans by engaging in the real environment, that is, by moving.
- Reference setting in which the agent plans in advance without moving. Instead, it interacts with a navigation graphic that encodes the building’s waterways but provides no visual observations.
- Pathdreamer setting in which the agent plans ahead without moving by interacting with the navigation graph and receiving the appropriate visual observations created by Pathdreamer.
The team observed that in the Pathdreamer setting, when the agent pre-plans three legs (around 6m), they achieve a 50.4% navigation success rate. This score is remarkably higher than the 40.6% success rate in the baseline without Pathdreamer. This demonstrates that Pathdreamer encodes meaningful and accessible visual, spatial and semantic knowledge about real-world interior environments. The agent success rate is 59% in the Ground-Truth framework. However, they point out that in this situation, the agent must devote a significant amount of time and resources to the physical exploration of a large number of trajectories, which would be prohibitive in a real context.
Going forward, the team plans to apply Pathdreamer to other integrated navigation tasks, such as Object-Nav, continuous VLN, and street-level navigation.