View Conference 2018: What does holophonic, ambisonic and spatialized audio mean?
Today at View Conference 2018 I attended a very interesting workshop by Gianni Ricciardi and Matteo Milani about immersive sound. I absolutely wanted to attend it, because I know that audio is very important for immersion in virtual reality, but I have a great lack of knowledge in that field. And today, thanks to the great lesson by the two teachers Ricciardi and Milani, I’ve finally started grasping something.
The most interesting part of the talk for me has been the beginning, when Gianni Ricciardi clarified me a lot of things regarding the terminology used for immersive sounds. He especially made me understand that there is a deep parallelism between the technologies developed for visuals and the ones developed for audio. Let me explain that.
The first visual reproductions of the world that we humans have created have been monoscopic (e.g. photos, paintings, 2D screens, etc…). At the same time (notice: “at the same time” doesn’t mean in the same temporal moment… I’m just making a comparison), also in audio we had the first audio that were mono, with the same sounds going to the left and right ear (imagine an old LP listened through headphones, for instance). This is for sure not realistic at all but can be ok for many applications (we still use 2D screens a lot!).
Then, humans started experimenting with stereoscopy and we had the first stereoscopes, that were objects similar to the present time cardboards, the let you see a special photography shot from two different points of view as a somewhat three-dimensional object.
In the same way, we created the first stereo headsets for stereo recordings: you had your headphones with the L and R letters on, and these let you hear different sounds with the two different ears, for an improved realism of the experience. Finally, your ears were able to hear different things, as it happens in our everyday life. And as the brain was able to mix the two images of the stereoscope to see a 3D object, the brain was able to mix the sounds of the two ears to obtain a realistic 3D sound.
Improving the previous model, people started developing systems to record professional stereoscopic videos, the ones that are used in 3D cinema (yes, that thing that every ten years gets hyped and then every one of us forgets again). At the same time, people started developing a system to record professional stereo audio, thanks to special objects that look like heads having the two microphones exactly in the position where humans have the eardrums. As you can imagine, these special heads are able to record the audio exactly as a human would perceive it. This approach is called holophony and is able to produce a high-quality audio recording… that has the problem that is fixed in space.
Think about the parallelism… the 3D cinema is not bad, but the problem is that the recording is seen from a fixed position in space, the one of the camera. If you try to move or rotate your head while watching a 3D movie, the 3D objects would not follow your movements in a coherent way. With holophonic sound, it is the same: the quality is great, it is perfectly stereo, but if you move your head, the sound remains fixed.
This is bad because our brain relies a lot on our heads movements to localize sounds: even if we don’t realize that, we often move (even slightly) our heads because using the difference in perceptions of sounds given by our heads movements, we are able to localize the sounds’ positions. And according to Mr. Ricciardi, we are great in this: we can locate sounds with an accuracy of the order of magnitude of centimeters. But if the sound doesn’t react well to our head movements, we are no more good in localizing it: that’s why while listening to a holophonic recording, we can’t understand exactly where it comes from. For instance, we can’t detect properly if a sound comes from behind us or from in front of us… we can detect left or right sources because of the difference in volume in our ears (a sound from the left will be felt stronger in the left ear), but we can’t detect well the position of sounds that have similar volumes in both years. So, this is a cool technology, but not cool enough. Being a technology from the 60s, this is understandable.
Later on, we invented 360 videos, that is a sphere around the user where it is projected everything that a user can see from his position: he can rotate his head and still see coherent visuals. This is a great improvement from the fixed pose of cinema 3D: now the user has still a fixed position, but at least can rotate his head. The recording is usually made with special cameras that are sometimes composed of arrays of cameras with different orientations. In a similar way, we have ambisonic sound. The ambisonic sound is a sound recorded by a special microphone, that is inside composed by multiple (usually 4) microphones that have different orientations. The resulting recording contains all the sounds perceived from that point, with information about the direction of the sound. It is like having a “sphere of sounds” recorded around the user.
Ambisonic recordings are great because they have a very high quality, are flexible and can be manipulated: for instance this “sphere” of sound can be rotated in real time to follow the head rotation of the user, so that to obtain an effect similar to the one visually obtained by a 360 photo, that shows you a different portion of the photo when you rotate your head. The disadvantage of ambisonics is that the resulting files are very big and that you have only one single sweet spot: the recording is made to feel real only from a single position… exactly as with 360 videos, if you move your head in space, what you perceive doesn’t change and the magic gets broken.
Matteo Milani explained very well some technical properties of ambisonic sounds: to make things easier, he said that they can be decomposed at different levels of quality. Usually, ambisonics has 4, 9 or 16 channels and the higher the number of channels, the higher is the resolution and quality of sounds (and the higher is the file dimension, of course). This means that with a 4-channel ambisonics you have a rough understanding of where a sound comes from, while with a 16-channels one, you are able to detect precisely the position of every sound source.
Due to their similarities, ambisonics are usually employed with 360 videos, so that from a vantage point, you can feel credible visuals and sounds. Facebook is investing in the tech and has even created a proprietary ambisonic format, called TBE, that has 8 channels, since it gives more precision to the horizontal component of sound and less to the vertical one (since while watching a 360 video via Facebook, you usually rotate your head to the left or to the right and very seldom look up and down) and so it is able to spare streaming bandwidth.
Also Youtube offers the ability to load ambisonic sounds together with 360 videos. Both Facebook and Youtube also offer the possibility to load an additional simple stereo track to the video. This is useful for those sounds, like the soundtrack, that doesn’t need a position in space and that would also be annoying if they had it. They just need to be heard every time. This choice of offering the possibility of uploading two types of sound is a very smart choice by the tech companies.
Gianni Ricciardi has also shown us how he added the ambisonic audio tracks to a video he has just made for the platform Within. The plugin offered by Facebook, the spatial workstation by Facebook, offers very simple tools to easily add spatialized sound to 360 videos by just putting the position of the audio sources on the equirectangular projection of the video. There are also facilities that help the composer by moving automatically the audio source by following the movements of the visual elements, so that, for instance, the source about the voide of an actor is able to follow the movements of the mouth of the actor, so that it remains coherent over time. Gianni has told us that while creating 360 audio, he still prefers working using the 2D version of the program on the screen and then putting on the Rift just for the final adjustments because continuously putting on and off the headset is annoying.
The final step for immersion in VR is room-scale: you have a 6 DOF environment (a real-time rendered environment, or a volumetric video) and you can freely move inside it. In audio, this is possible thanks to real-time 3D audio engines, that are able to produce binaural spatialized audio. Exactly as you go inside a scene in a game engine and put there 3D objects, you can also put there 3D audio sources and then, when you play the game, some real-time magic lets you hear these sounds in a realistic way, taking in count your position and your orientation. This way you can hear the audio exactly as in real life.
This is possible thanks to various tools: if you work with Unity, for instance, the engine already offers some spatialization features out of the box. But plugins like the 3D audio plugin for SteamVR can really offer cool features to increase realism, like the fact that the engine is able to simulate the sound waves bouncing onto the various objects of the scene. But to have the top realism, it is better to use a dedicated audio engine like Fmod, that can work together with Unity to produce top-notch 6DOF audio. This engine lets you do a lot of customizations and this means that for instance, a wall of a house can block the sound produced by a radio inside the bedroom so that behind the wall of the bedroom you can feel it very attenuated.
Ambisonics and 6 DOF audio sources can work together in the same project: as a rendered world can have a 360 photo as a skybox, a world with spatialized audio can have an ambisonic track to emulate the ambient sounds.
6 DOF audio is the future, but of course it is very complicated to be produced in a realistic way. A lot of time is still needed to have a perfect audio emulation.
And that’s it for this journey inside the various types of audio: I really thank a lot Gianni Ricciardi and Matteo Milani for all the useful info they gave to me. They also let me experience a lot of little apps to convey the ideas better and also gave me a lot of technical details that I haven’t fully understood (but anyway I nodded my head to not seem stupid :D).
Today it has been a very interesting day… I hope that tomorrow at View Conference, it will be even better!
Disclaimer: this blog contains advertisement and affiliate links to sustain itself. If you click on an affiliate link, I'll be very happy because I'll earn a small commission on your purchase. You can find my boring full disclosure here.