Over the last two years, we have all witnessed the rise of Augmented Reality and Virtual Reality as technological tools, used for training, entertainment, and much more. These visual media have a tremendous impact on how we perceive the world, both “actual” and digital. But even these powerful visual media fall flat without one very powerful sensory input, sound. Since “talkies” replaced silent film, producers of media have understood that sound creates as many, or even more, cues than visual media. Whether we are talking about traditional film, television, or the new media of virtual reality, without strong audio production, none of these visual media would hold our collective attention for very long. The question we seek to answer is “why is sound so important to a VR scene?” To that end, let’s examine some reasons human rely on this “secondary” sensory input to create convincing new “realities”. (I know the sarcasm of the use of the term “secondary” does not come across in writing, but trust me, it is there!)
Let’s start by looking at some of the basic differences in the way humans perceive visuals versus the way sound is perceived. First, and some might argue, foremost, there is the vast difference in the processing of sight vs sound within the human brain. To date, the fastest rate at which our brains can process an image is about 13 milliseconds. But truth be told, while our brains might recognize a specific image in that time, there is still a good bit of processing going on before our minds actually perceive and identify the image in question. Take film for example. Film uses a “frame rate” of 24 frames per second to create a seemingly smooth motion of objects in the screen, about 41.6 milliseconds per frame, while modern TVs use 30 or 60 frames per second, or about 33 and 16.5 milliseconds per frame, respectively. To most people, these individual frames are not perceived, but the totality of the image appears to move smoothly, without noticeable flickering. While the human eye can perceive a single image in a mere 13ms the human ear can perceive literally hundreds of thousands of events per second. We perceive these “events” as specific frequencies of sound and we perceive these sounds with an astonishing level of directional accuracy. The human eye might perceive timing in the visual cue of a flashing light, but the human ear’s timing is so precise that we are able to perceive the direction, approximate distance and relative motion of a sound with far greater precision than with the eyes alone. The audio timing cues are so precise that differences of just a few microseconds (that’s 1/10th of a millisecond) can be distinguished with relatively little “ear training”.
OK, great, so now we have just a few of the basic differences between these two incredible senses. What does that mean when it comes to developing “virtual reality” projects? The first thing we need to understand is that in creating a virtual reality scene, we are trying to immerse the user in a world that does not exist outside of the digital realm. To create a convincing scene, we must create all the sensory input that people have come to expect from their surroundings. This is a critical concept. While people may not always be able to articulate the reason a particular scene is not convincing, they will almost always be able to identify that “something” is missing or out of place. Here is an example of what I am talking about. I was recently playing a VR game, and this title was not very well implemented. (I am going to withhold the title because I have no desire to damage the reputation of the publisher) While playing the game there were plenty of Foley type effects, with the clash of steel on steel during sword fights and the “thunk” of the bow when firing arrows, but there were a lot of cues that were missing. The most noticeable of which was the simple effect of shuffle feet. This might seem minor, but when you are walking through a virtual dungeon, one would expect to hear the shuffling of a zombie or skeleton long before we would ever see them, but those sound effects were not there, leading me to be caught unaware of an attack on more than one occasion. Then there was the complete lack of environmental sounds. I could “see” the water dripping and running in small rivulets on stone, but I could not hear any associated sound. Again, this seems minor, but if you really pay attention I doubt you will find any environment that is silent, outside of an anechoic chamber. These small misses added up to one very unconvincing environment.
An example from the other end of the spectrum was a demonstration of a medical “Mechanism of Action” VR experience that was developed by a friend of mine. The experience puts the user in the brain and demonstrates the root causes of how scientists believe a migraine headache develops. It was nothing short of amazing. Every visual element had an audio cue. But, more importantly there were cues for objects and systems that I knew were there but could not see. When I test and review these experiences I will often just listen with my eyes closed and try to determine if the environmental audio paints an accurate picture for my mind without the visual cues. Very faint heartbeat in the background? CHECK! The subtle sound of blood moving through vessels? CHECK. Subtle electrical discharges in the distance to simulate brain activity? CHECK! The list goes on, but I think you get the point. This experience was very convincing, not because of what I could see, which was impressive in itself, but because of what I could not see, but KNEW was there due to the subtle and well produced audio cues. In one instance, I could here a soft scratching sound that almost sounded like chewing, off to my left. I turned my head to the left and there was a white blood cell destroying and devouring a virus. The audio cue was convincing enough to get me to turn my head to look at the source of the sound.
These simple examples demonstrate the need to consider everything in an environment when producing VR experiences. Visuals are incredibly important, but without audio cues the production lacks context and the entire experience feels flat and incomplete. You should be able to create an entire scene strictly from the audio in your production, without a single image. Make that happen, then add a well-produced visual product and you will have a winner, almost every time!