Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning
Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki
TLDR: Multisensory pretraining enhances RL for contact-rich tasks by learning expressive representations through masked autoencoding.

Contact-rich robot manipulation demands tight integration of vision, force, and proprioception. Our new work, Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning, introduces MSDP — a framework that uses masked autoencoding and cross-modal sensor fusion to learn expressive multisensory representations, paired with a novel asymmetric actor-critic architecture for efficient real-robot RL.
MSDP achieves ~90% success on challenging manipulation tasks using only 6,000 real-robot interactions, with the full pipeline completing in under 55 minutes. Adding a force-torque sensor alone improves performance by 14%. The method is robust to sensor noise, variable stiffness, external disturbances, and varying lighting conditions.
Checkout the Website for robot videos: https://msdp-pearl.github.io
