The Visual Microphone: Passive Recovery of Sound from Video

HowlingUlf · August 5, 2014, 10:27am

Strophoid · August 5, 2014, 9:55pm

That is amazing, actually really the kind of stuff I’d like to be researching in my studies the coming year!

Early21 · August 5, 2014, 11:24pm

Fascinating. I will put off that Neumann purchase because I already have an empty bag of chips! It does seem that the bag of chips is hearing something we are not hearing… an alien or something.

mozizo · August 5, 2014, 11:39pm

Amazing…
wonder how it captures this

alexis · August 11, 2014, 5:12am

I call BS, just on a gut feeling.

Then trying to think about it more quantitatively. The bag of chips/leaves will move as a summation of all the sounds in the room. Trying to deconstruct one “voice” from among all the sounds (radio in the background, phones ringing, air conditioner noise, air conditioner-caused air currents inducing potato chip bag movements … wouldn’t it be analogous to figuring out the tune of a song by inspection of the .wav file shape?

Further, they say they can detect frequencies several times higher than the frequency/frame rate of the video. Ready to be corrected here … doesn’t that contradict Nyquist … wouldn’t it all be aliasing without the ability to extract signal?

Strophoid · August 11, 2014, 9:41am

Of course you can only extract all the sounds in the room, but this is experimental so they can test it in silent environments.
As for the Nyquist frequency: Yes you are right, but with the rolling shutter in a consumer camera, you can take multiple time samples from the same frame, which increases the Nyquist frequency .

ggc · August 11, 2014, 12:20pm

Im not too convinced…

But,

What use would this have ? (Apart from spy agency usage)

alexis · August 11, 2014, 12:56pm

The science bits were just off the cuff on my part, it’s just a gut feeling (non-empirically supportable) that feeds my skepticism at this point!

But I’d be interested to know more about the part you wrote above which I bolded … “multiple samples from the same frame” in particular … wow!

Strophoid · August 11, 2014, 8:52pm

Haven’t got the time to go into it right now, but read this:

You’ll see it takes parts of the image at a different time, so if you know the pattern it makes, you have information from several time-instances within the same frame.

alexis · August 12, 2014, 6:51pm

Fascinating, thank you for the link, Strophoid!

So, to my frrble way of thinking, this technology is essentially using potato chip bags, plant leaves, and the like as lo-fi mechanical transducers of the sound pressure waves … does that sound right?

I’m just thinking the sound pressure wave-to-object motion coupling is going to be way too crude to “decode” meaningfully … that there would be too low a S/N ratio. I’m even wondering whether the examples were too good to be true!

But I’m keeping an open mind! (Maybe by mathematically combining the observed motion of multiple objects at once, the S/N ratio could be increased to a meaningful level …?).

Thanks again!

Strophoid · August 12, 2014, 11:39pm

I’m not sure how limited this is actually, I ‘think’ objects will resonate at the frequencies of the sound quite nicely.
Note that in the examples used, they had a very simple signal consisting of just a few sines at different frequencies. In silent surroundings I reckon with some smart gating in the frequency domain you can get a decent reproduction this way.

BriHar · August 13, 2014, 7:03am

This same technique has been used for quite some time by spy agencies.
I remember too a demonstration where they placed a tiny paper sticky dot on a window of a room where the blinds had been drawn. The window acting like a large diaphragm, they were able to record the tiny vibrations by monitoring the dot and listen in on the conversation taking place within.

HowlingUlf · August 13, 2014, 10:51am

HA!

Lots of fun additions in this topic!

peakae · August 13, 2014, 4:26pm

So this means we can reconstruct sounds from old silent-movies, oh wait to slow shutter speed

alexis · August 13, 2014, 5:00pm

Great thought, peakae!

No problem, it can be easily done!

Integrate the response of multiple objects in the frame. Even with a slower frame rate, this should help reduce the S/N , by eliminating other non-sensical solutions to the equation that would arise from analyzing the motion of only one object. That, plus a little inter-peak reconstruction, and we’ll be able to hear every vowel and regional accent!

HowlingUlf · August 13, 2014, 10:09pm

Then use some noise print capable program and remove unwanted stuff and it should sound better than the original? Well, maybe not haha!