PixelPlayer is a system that learns to localize the sounds that correspond to individual image regions in videos. The input sounds for the whole video are separated into a set of components that represent the sound of each pixel. The system is trained on a large number of videos that show people playing different instruments. No supervision is provided regarding which instruments appear in each video, where they are located, or how they sound. Klick on different positions in the video on the right to hear the sounds that correspond to the selected area of the picture. The original video is available to you on the left.
Credits: Computer Science and Artificial Intelligence Laboratory, and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology (MIT): Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba