Expert Speak Digital Frontiers
Published on Aug 25, 2020
The role of human voice in the communication of digital disinformation

Without doubt, extremism and xenophobia have been exacerbated in today’s world by the availability of pervasive communication. Rhetoric that incites these problems rides the wings of digital communication technology, reaching vulnerable minds more quickly and more impactfully. Worse, it allows their reactions to be swiftly synergised and directed towards targets. Much like a freely-spreading global pandemic, uncensored disinformation infects and consumes all who are psychologically vulnerable – anyone who cares to listen.

The Sensory Impact of Sound

People consume digital information through four means: audio, video, images and text. Each modality is perceived and processed by the brain through different sensory pathways. Each can independently create impressions and lasting memories, incite the mind, arouse feelings and thoughts, and cause those to be translated into action. However, each modality, with equal exposure, affects the mind to different degrees.

When it comes to conveying complex ideas and ideologies, human speech (and the human voice) is by far the most powerful influencer. Textual spread is limited by literacy and other issues, and it is hard to convey cogent and complex ideologies through visual imagery.

How sound and the human voice influence the brain

Sound has a profound effect on our perception of the world: of people, their interactions and intentions, situations and our environment<1>.

The messages conveyed by images and written text, when accompanied by sound, are intensified and driven deeper into the human psyche. When images and text are accompanied by sound, the areas of the brain that are activated amount to greater than the sum of the unimodal processing areas—the auditory cortex, visual cortex and areas of the frontal cortex that process linguistic information (both spoken language and text activate the linguistic processing areas). Simultaneous multisensory inputs shift activations from sensory-specific (“unimodal”) cortices of the brain into multisensory (“heteromodal”) areas of the brain, and at the same time, each type of sensory input actually affects the unimodal processing areas of the brain as well<2><3>.

While this applies to all stimuli, other studies show that the effect of sound is profound on our perceptions about co-occurring stimuli. Sound can not only enhance, but alter our perception of the world, even when nothing else changes in it<4><5><6><7>.

For instance, an online article on police brutality may carry an image of people running (only showing their backs), with the caption ‘People dashing to shelter from charging police’. If the event is unlikely, the visual presentation and the authenticity of the report, and their correlation could be questioned. However, when the same scene is presented with sounds <8><9> of police sirens, and people shouting situation-relevant words, the retrospective reaction is more likely to be visceral anger directed at the police. The presentation may go unquestioned. And yet, objectively, there was never a guarantee that the sounds and the visual scene ever co-occurred. In general, people question acoustic evidence much less than textual or visual evidence, because sound, being invisible and pervasive, is not subconsciously “observed” well enough to be learned about.

Humans routinely form impressions about other people based on their voices. These impressions profoundly influence how they interact with and react to them. Human voice is also a powerful mass stimulus, and can instigate mass responses and mass hysteria more intensely than other kinds of presentation<10>. For instance, the rabid responses that Adolf Hitler’s speeches to the Nazi Party evoked in its audiences, for example, can be evidenced in historical recordings. It has been conjectured that in addition to other factors, the conversion of an otherwise decent society to a murderous, xenophobic one was powerfully driven, compounded and exponentiated by Hitler’s persuasive speeches—his vocal delivery of his ideologies.

Humans are highly susceptible to sounds. While in some cases a picture is worth a thousand words, in others, a spoken sentence may well be worth a thousand pictures.

Voice and Disinformation

It is not just the content of speech, but the sound of the human voice, that carries this potential of driving the human psyche. With high likelihood, Hitler with a comic voice (for instance, that of Mickey Mouse) may not have succeeded in driving the Nazis to extremes.

To comprehend the role of the human voice in digital disinformation, it is important to realise how pervasive human speech is in the digital world. As of October 2019, people watched five billion videos each day on YouTube alone<11>. Similar content is delivered over many other channels on the internet, digital communication lines and radio, propelling the uptake of audio and audio-visual media into addi- tional billions each day. Social media channels add to these numbers. 5.2 million users watched branded Instagram videos during the first quarter of 2017 alone<12>.

An unquantifiable, but increasingly large fraction of this volume of media supports disinformation.

The consequences of the deceptions carried out through speech are comparable to physical crimes.  They range from psychological illnesses and causing financial ruin by talking people into glib financial schemes, to actual death caused by misuse of commodities that are convincingly touted as safe, or about which crucial information is withheld.

What emerges from such analyses is the (surprising) realisation that the originator of an insurgent ide- ology is just the spark. The communication channels are the dry down – the forests that convey the fire to those around. Operationally, the problem of the spread of insurgence lies not with the originators and recipients of disinformation—because humans will behave in conditioned ways—but with the messenger.

The deluge of unfettered and unrestricted information being ferried around by the messenger each day, and it is the messenger that must be curbed, technologically. We discuss this next.

Technological Solutions

Psychologically, insurgence and radicalism stem from existential fears driven by complex socioeconomic factors that put individuals and groups at significant disadvantages from early life. Technological advances alone do not help these groups; there is enough evidence to show that they only serve to shift the requirements of global job markets, pushing already disadvantaged groups further down the socioeconomic ladder. These are serious problems and must be addressed at the root to truly dissolve extremism.

However, these socioeconomic issues can be set aside to think purely (robotically) in terms of the technological containment of this scourge, for no reason other than these are the only solutions that are clear and immediately actionable.

With a focus again on speech, the important question is how to curb the free spread of insurgent rhetoric? The pachyderm in the room here is the messenger – the medium that operates in an uncurated manner. The simplest solution might be to render all spoken communication Mickey Mouse-ish unless properly curated. We expect that this would curb the effect of spoken rhetoric<13> on the masses, to the point of making it completely insipid and ineffective. However, this is not an acceptable solution in the real world. More practical (and acceptable) potential gatekeeper technologies are mentioned below.

Speech recognition

Speech recognition technologies deal with the transcription of recorded speech. While the transcription of spontaneous (freely spoken) speech, and speech in high noise environments still remains an unsolved problem, automatic speech recognition systems are getting better at these rapidly. Based on transcribed speech, databases for different levels of fact-checking can be built. Once speech is converted to textual form, other powerful natural language processing and natural language understanding technologies can help curate content.

Speaker verification and Identification

Speaker verification and identification technologies are based on matching of voiceprints to those present in carefully curated voiceprint databases. In speaker verification, the identity of the speaker is given at the outset, and voice matching verifies the authenticity of the claim. In speaker identification, the identity of the speaker is unknown at the outset and is found by matching the given voiceprint to those present in a database. Misrepresentation through machine-generated and human voices (propaganda from fake sources) can be monitored through the use of these technologies.

Speech paralinguistics

The word paralinguistics stems from the word paralanguage (or vocalics). It refers to those aspects of speech-based communication that qualify or alter the meaning of the spoken words, and may convey the emotion, feelings or intent of the speaker to the listener without explicit verbalisation. Paralinguistics can help curate speech at a meta level by flagging it for the presence of different (incongruent) emotions, lies and other tactical deceptions carried out through speech. These technologies are limited in accuracy, since they use data labelled by humans to learn from, and humans can only label as accurately as they judge or perceive.

Voice profiling

Voice profiling involves the analysis of human voice to deduce a plethora of information about the speaker<14>. Human voice a powerful bio-parametric indicator. It carries information that can be linked to current (referring to the time of recording of the voice) physical, physiological, demographic, sociological, medical, psychological and other characteristics of the speaker, and to the speaker’s environment.

Voice profiling technologies can also inform us about those aspects of voice that can be altered to render the content more benign, without affecting the perceived quality of speech. These technologies can perhaps help us implement the Mickey Mouse solution in an insidious (and acceptable) manner – to take away the confidence, control and leadership related qualities embedded in voice, rendering it less effective.


From an acoustic perspective, the only actionable strategy to tackle the spread of insurgent ideologies seems to be the preposterous equivalent of killing the messenger, instead of trying to modify the factors that condition the originators to send the message and the recipients to absorb the message.

Voice content delivery channels must be analysed objectively for their potential impact in spreading inciteful content. Voice technologies must then be judiciously applied to curate the most harmful ones. Channels that deliver the greatest acoustic content volume per unit time, and per unit population size, must be held to some minimal regulatory standards for delivering authenticated content, even if the standards merely require them to have made a reasonable, scientifically supported attempt to curate content.


<1> John Neuhoff, “Ecological psychoacoustics”, Brill & The Hague Academy of International Law, July 6, 2004.

<2> Emiliano Macaluso and Jon Driver, “Multisensory spatial interactions: a window onto functional integration in the human brain”, Trends in Neurosciences 28, no. 5 (2005): 264-271.

<3> Jon Driver and Charles Spence, “Multisensory perception: beyond modularity and convergence”, Current Biology 10, no. 20 (2000): R731-R735.

<4> Ladan Shams, Yukiyasu Kamitani, Samuel Thompson, and Shinsuke Shimojo, “Sound alters visual evoked potentials in humans”, Neuroreport 12, no. 17 (2001): 3849-3852.

<5> Tony Ro, Johanan Hsu, Nafi E. Yasar, L. Caitlin Elmore, and Michael S. Beauchamp, “Sound enhances touch perception”, Experimental Brain Research 195, no. 1 (2009): 135-143.

<6> Bernhard E. Riecke, Daniel Feuereissen, John J. Rieser, and Timothy P. McNamara, “Spatial-ized sound enhances biomechanically-induced self-motion illusion (vection)”, in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2011, 2799-2802.

<7> Katsumi Watanabe and Shinsuke Shimojo, “When sound affects vision: effects of auditory grouping on visual motion perception”, Psychological Science 12, no. 2 (2001): 109-116.

<8> Ladan Shams and Robyn Kim, “Crossmodal influences on visual perception”, Physics of Life Reviews 7, no. 3 (2010): 269-284.

<9> Casey O’Callaghan, “Perception and multimodality.” Oxford Handbook of Philosophy of Cognitive Science (2012): 92-117.

<10> Jonathan Leader Maynard and Susan Benesch, “Dangerous speech and dangerous ideology: An integrated model for monitoring and prevention”, Genocide Studies and Prevention 9, no. 3 (2016): 70-95.

<11> Mitja Rutnik, “YouTube in numbers: Monthly views, most popular video, and more fun stats!”, Android Authority, August 11, 2019.

<12>The Top 10 Instagram Video Statistics Marketers Should Know”, MediaKix, December 17, 2018.

<13> Casey A Klofstad, Rindy C Anderson, Susan Peters, “Sounds like a winner: voice pitch influences perception of leadership capacity in both men and women”, Proceedings of the Royal Society B: Biological Sciences 279, no. 1738 (2012): 2698-2704.

<14> Rita Singh, Profiling Humans from their Voice. Singapore: Springer, 2019.

The views expressed above belong to the author(s). ORF research and analyses now available on Telegram! Click here to access our curated content — blogs, longforms and interviews.