SoloParkour: Constrained Reinforcement Learning for Visual Locomotion from Privileged Experience

Abstract

Parkour poses a significant challenge for legged robots, requiring navigation through complex environments with agility and precision based on limited sensory inputs. In this work, we introduce a novel method for training end-to-end visual policies, from depth pixels to robot control commands, to achieve agile and safe quadruped locomotion.

We formulate robot parkour as a constrained reinforcement learning (RL) problem designed to maximize the emergence of agile skills within the robot's physical limits while ensuring safety. We first train a policy without vision using privileged information about the robot's surroundings. We then generate experience from this privileged policy to warm-start a sample efficient off-policy RL algorithm from depth images. This allows the robot to adapt behaviors from this privileged experience to visual locomotion while circumventing the high computational costs of RL directly from pixels.

We demonstrate the effectiveness of our method on a real Solo-12 robot, showcasing its capability to perform a variety of parkour skills such as walking, climbing, leaping, and crawling.

Method

We frame agile locomotion over challenging terrains form depth images as a Constrained Reinforcement Learning (RL) problem. Our method employs both on-policy and off-policy versions of Constraints as Terminations (CaT) to ensure constraint satisfaction.

Training a policy end-to-end from depth images to robot actions using RL would ideally enable the robot to fully leverage its sensory inputs and hardware capabilities. However, rendering depth images in simulation is computationally expensive. We propose a two-stage approach that makes visual RL sample-efficient enough for effectively training agile and safe locomotion policies in the Isaac Gym simulator:

Stage 1: We train a privileged policy that uses information with low computational overhead - a heightmap scan of the robot surroundings and the height of the nearby floating objects - instead of depth images. We employ PPO with CaT to maximize terrain traversal while adhering to safety and style constraints.
Stage 2: The Stage 1 policy is used to generate a dataset of rollouts, collecting depth images in the process. We leverage this privileged experience to warm-start a sample-efficient constrained off-policy RL algorithm based on RLPD with CaT. This allows us to directly train the visual policy to agressively maximize the constrained RL objective while circumventing the computational cost of visual RL from scratch.

The resulting visual policy is directly transferred to the real Solo-12 robot equipped with an egocentric depth camera.

BibTeX

@inproceedings{chanesane2024solo,
    title = {SoloParkour: Constrained Reinforcement Learning for Visual Locomotion from Privileged Experience},
    author = {Elliot Chane-Sane and Joseph Amigo and Thomas Flayols and Ludovic Righetti and Nicolas Mansard},
    booktitle = {Conference on Robot Learning (CoRL)},
    year = {2024},
}