Advances in Understanding the Phenomena and Processing in Audiovisual Speech Perception

A special issue of Brain Sciences (ISSN 2076-3425). This special issue belongs to the section "Neurolinguistics".

Deadline for manuscript submissions: closed (31 May 2023) | Viewed by 16200

Special Issue Editor

Department of Psychology and Logopedics, University of Helsinki, Helsinki, Finland
Interests: audiovisual speech perception; multisensory perception; vision; audition; touch; haptics

Special Issue Information

Dear Colleagues,

Audiovisual speech perception is a key topic in multisensory research because speech is a naturally multisensory signal crucial for human communication. Despite decades of research, the apparently simple problem of how the brain combines information from the talking face with the speaking voice is still unsolved. There are huge advances, such as brain imaging findings showing that regions previously considered unisensory are also activated by other sensory modalities, e.g. the auditory cortex is activated by visual speech alone. Computational modeling has established that Bayesian inference accounts for the merging of auditory and visual speech, so that cues are weighted according to their reliability and combined optimally. Recently, individual differences in audiovisual speech perception have raised interest. Modern research methods are used to investigate individuals, in addition to the traditional approach with methods that require averaging across participants. This Special Issue calls for advances in the understanding of perceptual phenomena and the processing of audiovisual speech, using different approaches, from behavioral studies to brain imaging and modeling.

Dr. Kaisa Tiippana
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Brain Sciences is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2200 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • audiovisual
  • speech
  • perception
  • vision
  • audition
  • multisensory
  • neural processing

Published Papers (13 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research, Review, Other

3 pages, 196 KiB  
Editorial
Advances in Understanding the Phenomena and Processing in Audiovisual Speech Perception
by Kaisa Tiippana
Brain Sci. 2023, 13(9), 1345; https://doi.org/10.3390/brainsci13091345 - 20 Sep 2023
Viewed by 599
Abstract
The Special Issue entitled “Advances in Understanding the Phenomena and Processing in Audiovisual Speech Perception” attracted a variety of articles written by prominent authors in the field [...] Full article

Research

Jump to: Editorial, Review, Other

17 pages, 2857 KiB  
Article
The Benefit of Bimodal Training in Voice Learning
by Serena Zadoorian and Lawrence D. Rosenblum
Brain Sci. 2023, 13(9), 1260; https://doi.org/10.3390/brainsci13091260 - 30 Aug 2023
Cited by 1 | Viewed by 607
Abstract
It is known that talkers can be recognized by listening to their specific vocal qualities—breathiness and fundamental frequencies. However, talker identification can also occur by focusing on the talkers’ unique articulatory style, which is known to be available auditorily and visually and can [...] Read more.
It is known that talkers can be recognized by listening to their specific vocal qualities—breathiness and fundamental frequencies. However, talker identification can also occur by focusing on the talkers’ unique articulatory style, which is known to be available auditorily and visually and can be shared across modalities. Evidence shows that voices heard while seeing talkers’ faces are later recognized better on their own compared to the voices heard alone. The present study investigated whether the facilitation of voice learning through facial cues relies on talker-specific articulatory or nonarticulatory facial information. Participants were initially trained to learn the voices of ten talkers presented either on their own or together with (a) an articulating face, (b) a static face, or (c) an isolated articulating mouth. Participants were then tested on recognizing the voices on their own regardless of their training modality. Consistent with previous research, voices learned with articulating faces were recognized better on their own compared to voices learned alone. However, isolated articulating mouths did not provide an advantage in learning the voices. The results demonstrated that learning voices while seeing faces resulted in better voice learning compared to the voices learned alone. Full article
Show Figures

Figure 1

13 pages, 3280 KiB  
Article
Investigation of Cross-Language and Stimulus-Dependent Effects on the McGurk Effect with Finnish and Japanese Speakers and Listeners
by Kaisa Tiippana, Yuta Ujiie, Tarja Peromaa and Kohske Takahashi
Brain Sci. 2023, 13(8), 1198; https://doi.org/10.3390/brainsci13081198 - 13 Aug 2023
Cited by 1 | Viewed by 1010
Abstract
In the McGurk effect, perception of a spoken consonant is altered when an auditory (A) syllable is presented with an incongruent visual (V) syllable (e.g., A/pa/V/ka/ is often heard as /ka/ or /ta/). The McGurk effect provides a measure for visual influence on [...] Read more.
In the McGurk effect, perception of a spoken consonant is altered when an auditory (A) syllable is presented with an incongruent visual (V) syllable (e.g., A/pa/V/ka/ is often heard as /ka/ or /ta/). The McGurk effect provides a measure for visual influence on speech perception, becoming stronger the lower the proportion of auditory correct responses. Cross-language effects are studied to understand processing differences between one’s own and foreign languages. Regarding the McGurk effect, it has sometimes been found to be stronger with foreign speakers. However, other studies have shown the opposite, or no difference between languages. Most studies have compared English with other languages. We investigated cross-language effects with native Finnish and Japanese speakers and listeners. Both groups of listeners had 49 participants. The stimuli (/ka/, /pa/, /ta/) were uttered by two female and male Finnish and Japanese speakers and presented in A, V and AV modality, including a McGurk stimulus A/pa/V/ka/. The McGurk effect was stronger with Japanese stimuli in both groups. Differences in speech perception were prominent between individual speakers but less so between native languages. Unisensory perception correlated with McGurk perception. These findings suggest that stimulus-dependent features contribute to the McGurk effect. This may have a stronger influence on syllable perception than cross-language factors. Full article
Show Figures

Figure 1

15 pages, 3752 KiB  
Article
The Processing of Audiovisual Speech Is Linked with Vocabulary in Autistic and Nonautistic Children: An ERP Study
by Kacie Dunham-Carr, Jacob I. Feldman, David M. Simon, Sarah R. Edmunds, Alexander Tu, Wayne Kuang, Julie G. Conrad, Pooja Santapuram, Mark T. Wallace and Tiffany G. Woynaroski
Brain Sci. 2023, 13(7), 1043; https://doi.org/10.3390/brainsci13071043 - 08 Jul 2023
Cited by 1 | Viewed by 1423
Abstract
Explaining individual differences in vocabulary in autism is critical, as understanding and using words to communicate are key predictors of long-term outcomes for autistic individuals. Differences in audiovisual speech processing may explain variability in vocabulary in autism. The efficiency of audiovisual speech processing [...] Read more.
Explaining individual differences in vocabulary in autism is critical, as understanding and using words to communicate are key predictors of long-term outcomes for autistic individuals. Differences in audiovisual speech processing may explain variability in vocabulary in autism. The efficiency of audiovisual speech processing can be indexed via amplitude suppression, wherein the amplitude of the event-related potential (ERP) is reduced at the P2 component in response to audiovisual speech compared to auditory-only speech. This study used electroencephalography (EEG) to measure P2 amplitudes in response to auditory-only and audiovisual speech and norm-referenced, standardized assessments to measure vocabulary in 25 autistic and 25 nonautistic children to determine whether amplitude suppression (a) differs or (b) explains variability in vocabulary in autistic and nonautistic children. A series of regression analyses evaluated associations between amplitude suppression and vocabulary scores. Both groups demonstrated P2 amplitude suppression, on average, in response to audiovisual speech relative to auditory-only speech. Between-group differences in mean amplitude suppression were nonsignificant. Individual differences in amplitude suppression were positively associated with expressive vocabulary through receptive vocabulary, as evidenced by a significant indirect effect observed across groups. The results suggest that efficiency of audiovisual speech processing may explain variance in vocabulary in autism. Full article
Show Figures

Figure 1

18 pages, 3281 KiB  
Article
The Effect of Cued-Speech (CS) Perception on Auditory Processing in Typically Hearing (TH) Individuals Who Are Either Naïve or Experienced CS Producers
by Cora Jirschik Caron, Coriandre Vilain, Jean-Luc Schwartz, Clémence Bayard, Axelle Calcus, Jacqueline Leybaert and Cécile Colin
Brain Sci. 2023, 13(7), 1036; https://doi.org/10.3390/brainsci13071036 - 07 Jul 2023
Cited by 1 | Viewed by 954
Abstract
Cued Speech (CS) is a communication system that uses manual gestures to facilitate lipreading. In this study, we investigated how CS information interacts with natural speech using Event-Related Potential (ERP) analyses in French-speaking, typically hearing adults (TH) who were either naïve or experienced [...] Read more.
Cued Speech (CS) is a communication system that uses manual gestures to facilitate lipreading. In this study, we investigated how CS information interacts with natural speech using Event-Related Potential (ERP) analyses in French-speaking, typically hearing adults (TH) who were either naïve or experienced CS producers. The audiovisual (AV) presentation of lipreading information elicited an amplitude attenuation of the entire N1 and P2 complex in both groups, accompanied by N1 latency facilitation in the group of CS producers. Adding CS gestures to lipread information increased the magnitude of effects observed at the N1 time window, but did not enhance P2 amplitude attenuation. Interestingly, presenting CS gestures without lipreading information yielded distinct response patterns depending on participants’ experience with the system. In the group of CS producers, AV perception of CS gestures facilitated the early stage of speech processing, while in the group of naïve participants, it elicited a latency delay at the P2 time window. These results suggest that, for experienced CS users, the perception of gestures facilitates early stages of speech processing, but when people are not familiar with the system, the perception of gestures impacts the efficiency of phonological decoding. Full article
Show Figures

Figure 1

19 pages, 3878 KiB  
Article
Event-Related Potentials in Assessing Visual Speech Cues in the Broader Autism Phenotype: Evidence from a Phonemic Restoration Paradigm
by Vanessa Harwood, Alisa Baron, Daniel Kleinman, Luca Campanelli, Julia Irwin and Nicole Landi
Brain Sci. 2023, 13(7), 1011; https://doi.org/10.3390/brainsci13071011 - 30 Jun 2023
Cited by 1 | Viewed by 1042
Abstract
Audiovisual speech perception includes the simultaneous processing of auditory and visual speech. Deficits in audiovisual speech perception are reported in autistic individuals; however, less is known regarding audiovisual speech perception within the broader autism phenotype (BAP), which includes individuals with elevated, yet subclinical, [...] Read more.
Audiovisual speech perception includes the simultaneous processing of auditory and visual speech. Deficits in audiovisual speech perception are reported in autistic individuals; however, less is known regarding audiovisual speech perception within the broader autism phenotype (BAP), which includes individuals with elevated, yet subclinical, levels of autistic traits. We investigate the neural indices of audiovisual speech perception in adults exhibiting a range of autism-like traits using event-related potentials (ERPs) in a phonemic restoration paradigm. In this paradigm, we consider conditions where speech articulators (mouth and jaw) are present (AV condition) and obscured by a pixelated mask (PX condition). These two face conditions were included in both passive (simply viewing a speaking face) and active (participants were required to press a button for a specific consonant–vowel stimulus) experiments. The results revealed an N100 ERP component which was present for all listening contexts and conditions; however, it was attenuated in the active AV condition where participants were able to view the speaker’s face, including the mouth and jaw. The P300 ERP component was present within the active experiment only, and significantly greater within the AV condition compared to the PX condition. This suggests increased neural effort for detecting deviant stimuli when visible articulation was present and visual influence on perception. Finally, the P300 response was negatively correlated with autism-like traits, suggesting that higher autistic traits were associated with generally smaller P300 responses in the active AV and PX conditions. The conclusions support the finding that atypical audiovisual processing may be characteristic of the BAP in adults. Full article
Show Figures

Figure 1

35 pages, 4927 KiB  
Article
Modality-Specific Perceptual Learning of Vocoded Auditory versus Lipread Speech: Different Effects of Prior Information
by Lynne E. Bernstein, Edward T. Auer and Silvio P. Eberhardt
Brain Sci. 2023, 13(7), 1008; https://doi.org/10.3390/brainsci13071008 - 29 Jun 2023
Cited by 2 | Viewed by 951
Abstract
Traditionally, speech perception training paradigms have not adequately taken into account the possibility that there may be modality-specific requirements for perceptual learning with auditory-only (AO) versus visual-only (VO) speech stimuli. The study reported here investigated the hypothesis that there are modality-specific differences in [...] Read more.
Traditionally, speech perception training paradigms have not adequately taken into account the possibility that there may be modality-specific requirements for perceptual learning with auditory-only (AO) versus visual-only (VO) speech stimuli. The study reported here investigated the hypothesis that there are modality-specific differences in how prior information is used by normal-hearing participants during vocoded versus VO speech training. Two different experiments, one with vocoded AO speech (Experiment 1) and one with VO, lipread, speech (Experiment 2), investigated the effects of giving different types of prior information to trainees on each trial during training. The training was for four ~20 min sessions, during which participants learned to label novel visual images using novel spoken words. Participants were assigned to different types of prior information during training: Word Group trainees saw a printed version of each training word (e.g., “tethon”), and Consonant Group trainees saw only its consonants (e.g., “t_th_n”). Additional groups received no prior information (i.e., Experiment 1, AO Group; Experiment 2, VO Group) or a spoken version of the stimulus in a different modality from the training stimuli (Experiment 1, Lipread Group; Experiment 2, Vocoder Group). That is, in each experiment, there was a group that received prior information in the modality of the training stimuli from the other experiment. In both experiments, the Word Groups had difficulty retaining the novel words they attempted to learn during training. However, when the training stimuli were vocoded, the Word Group improved their phoneme identification. When the training stimuli were visual speech, the Consonant Group improved their phoneme identification and their open-set sentence lipreading. The results are considered in light of theoretical accounts of perceptual learning in relationship to perceptual modality. Full article
Show Figures

Figure 1

13 pages, 1223 KiB  
Article
Deficient Audiovisual Speech Perception in Schizophrenia: An ERP Study
by Erfan Ghaneirad, Ellyn Saenger, Gregor R. Szycik, Anja Čuš, Laura Möde, Christopher Sinke, Daniel Wiswede, Stefan Bleich and Anna Borgolte
Brain Sci. 2023, 13(6), 970; https://doi.org/10.3390/brainsci13060970 - 19 Jun 2023
Cited by 1 | Viewed by 1116
Abstract
In everyday verbal communication, auditory speech perception is often disturbed by background noise. Especially in disadvantageous hearing conditions, additional visual articulatory information (e.g., lip movement) can positively contribute to speech comprehension. Patients with schizophrenia (SZs) demonstrate an aberrant ability to integrate visual and [...] Read more.
In everyday verbal communication, auditory speech perception is often disturbed by background noise. Especially in disadvantageous hearing conditions, additional visual articulatory information (e.g., lip movement) can positively contribute to speech comprehension. Patients with schizophrenia (SZs) demonstrate an aberrant ability to integrate visual and auditory sensory input during speech perception. Current findings about underlying neural mechanisms of this deficit are inconsistent. Particularly and despite the importance of early sensory processing in speech perception, very few studies have addressed these processes in SZs. Thus, in the present study, we examined 20 adult subjects with SZ and 21 healthy controls (HCs) while presenting audiovisual spoken words (disyllabic nouns) either superimposed by white noise (−12 dB signal-to-noise ratio) or not. In addition to behavioral data, event-related brain potentials (ERPs) were recorded. Our results demonstrate reduced speech comprehension for SZs compared to HCs under noisy conditions. Moreover, we found altered N1 amplitudes in SZ during speech perception, while P2 amplitudes and the N1-P2 complex were similar to HCs, indicating that there may be disturbances in multimodal speech perception at an early stage of processing, which may be due to deficits in auditory speech perception. Moreover, a positive relationship between fronto-central N1 amplitudes and the positive subscale of the Positive and Negative Syndrome Scale (PANSS) has been observed. Full article
Show Figures

Figure 1

12 pages, 1453 KiB  
Article
A Visual Speech Intelligibility Benefit Based on Speech Rhythm
by Saya Kawase, Chris Davis and Jeesun Kim
Brain Sci. 2023, 13(6), 932; https://doi.org/10.3390/brainsci13060932 - 08 Jun 2023
Cited by 2 | Viewed by 1038
Abstract
This study examined whether visual speech provides speech-rhythm information that perceivers can use in speech perception. This was tested by using speech that naturally varied in the familiarity of its rhythm. Thirty Australian English L1 listeners performed a speech perception in noise task [...] Read more.
This study examined whether visual speech provides speech-rhythm information that perceivers can use in speech perception. This was tested by using speech that naturally varied in the familiarity of its rhythm. Thirty Australian English L1 listeners performed a speech perception in noise task with English sentences produced by three speakers: an English L1 speaker (familiar rhythm); an experienced English L2 speaker who had a weak foreign accent (familiar rhythm), and an inexperienced English L2 speaker who had a strong foreign accent (unfamiliar speech rhythm). The spoken sentences were presented in three conditions: Audio-Only (AO), Audio-Visual with mouth covered (AVm), and Audio-Visual (AV). Speech was best recognized in the AV condition regardless of the degree of foreign accent. However, speech recognition in AVm was better than AO for the speech with no foreign accent and with a weak accent, but not for the speech with a strong accent. A follow-up experiment was conducted that only used the speech with a strong foreign accent, under more audible conditions. The results also showed no difference between the AVm and AO conditions, indicating the null effect was not due to a floor effect. We propose that speech rhythm is conveyed by the motion of the jaw opening and closing, and perceivers use this information to better perceive speech in noise. Full article
Show Figures

Figure 1

16 pages, 1170 KiB  
Article
The McGurk Illusion: A Default Mechanism of the Auditory System
by Zunaira J. Iqbal, Antoine J. Shahin, Heather Bortfeld and Kristina C. Backer
Brain Sci. 2023, 13(3), 510; https://doi.org/10.3390/brainsci13030510 - 19 Mar 2023
Cited by 2 | Viewed by 1747
Abstract
Recent studies have questioned past conclusions regarding the mechanisms of the McGurk illusion, especially how McGurk susceptibility might inform our understanding of audiovisual (AV) integration. We previously proposed that the McGurk illusion is likely attributable to a default mechanism, whereby either the visual [...] Read more.
Recent studies have questioned past conclusions regarding the mechanisms of the McGurk illusion, especially how McGurk susceptibility might inform our understanding of audiovisual (AV) integration. We previously proposed that the McGurk illusion is likely attributable to a default mechanism, whereby either the visual system, auditory system, or both default to specific phonemes—those implicated in the McGurk illusion. We hypothesized that the default mechanism occurs because visual stimuli with an indiscernible place of articulation (like those traditionally used in the McGurk illusion) lead to an ambiguous perceptual environment and thus a failure in AV integration. In the current study, we tested the default hypothesis as it pertains to the auditory system. Participants performed two tasks. One task was a typical McGurk illusion task, in which individuals listened to auditory-/ba/ paired with visual-/ga/ and judged what they heard. The second task was an auditory-only task, in which individuals transcribed trisyllabic words with a phoneme replaced by silence. We found that individuals’ transcription of missing phonemes often defaulted to ‘/d/t/th/’, the same phonemes often experienced during the McGurk illusion. Importantly, individuals’ default rate was positively correlated with their McGurk rate. We conclude that the McGurk illusion arises when people fail to integrate visual percepts with auditory percepts, due to visual ambiguity, thus leading the auditory system to default to phonemes often implicated in the McGurk illusion. Full article
Show Figures

Figure 1

Review

Jump to: Editorial, Research, Other

14 pages, 341 KiB  
Review
The Role of Talking Faces in Infant Language Learning: Mind the Gap between Screen-Based Settings and Real-Life Communicative Interactions
by Joan Birulés, Louise Goupil, Jérémie Josse and Mathilde Fort
Brain Sci. 2023, 13(8), 1167; https://doi.org/10.3390/brainsci13081167 - 05 Aug 2023
Cited by 1 | Viewed by 1201
Abstract
Over the last few decades, developmental (psycho) linguists have demonstrated that perceiving talking faces audio-visually is important for early language acquisition. Using mostly well-controlled and screen-based laboratory approaches, this line of research has shown that paying attention to talking faces is likely to [...] Read more.
Over the last few decades, developmental (psycho) linguists have demonstrated that perceiving talking faces audio-visually is important for early language acquisition. Using mostly well-controlled and screen-based laboratory approaches, this line of research has shown that paying attention to talking faces is likely to be one of the powerful strategies infants use to learn their native(s) language(s). In this review, we combine evidence from these screen-based studies with another line of research that has studied how infants learn novel words and deploy their visual attention during naturalistic play. In our view, this is an important step toward developing an integrated account of how infants effectively extract audiovisual information from talkers’ faces during early language learning. We identify three factors that have been understudied so far, despite the fact that they are likely to have an important impact on how infants deploy their attention (or not) toward talking faces during social interactions: social contingency, speaker characteristics, and task- dependencies. Last, we propose ideas to address these issues in future research, with the aim of reducing the existing knowledge gap between current experimental studies and the many ways infants can and do effectively rely upon the audiovisual information extracted from talking faces in their real-life language environment. Full article
19 pages, 393 KiB  
Review
Age-Related Changes to Multisensory Integration and Audiovisual Speech Perception
by Jessica L. Pepper and Helen E. Nuttall
Brain Sci. 2023, 13(8), 1126; https://doi.org/10.3390/brainsci13081126 - 25 Jul 2023
Cited by 2 | Viewed by 1670
Abstract
Multisensory integration is essential for the quick and accurate perception of our environment, particularly in everyday tasks like speech perception. Research has highlighted the importance of investigating bottom-up and top-down contributions to multisensory integration and how these change as a function of ageing. [...] Read more.
Multisensory integration is essential for the quick and accurate perception of our environment, particularly in everyday tasks like speech perception. Research has highlighted the importance of investigating bottom-up and top-down contributions to multisensory integration and how these change as a function of ageing. Specifically, perceptual factors like the temporal binding window and cognitive factors like attention and inhibition appear to be fundamental in the integration of visual and auditory information—integration that may become less efficient as we age. These factors have been linked to brain areas like the superior temporal sulcus, with neural oscillations in the alpha-band frequency also being implicated in multisensory processing. Age-related changes in multisensory integration may have significant consequences for the well-being of our increasingly ageing population, affecting their ability to communicate with others and safely move through their environment; it is crucial that the evidence surrounding this subject continues to be carefully investigated. This review will discuss research into age-related changes in the perceptual and cognitive mechanisms of multisensory integration and the impact that these changes have on speech perception and fall risk. The role of oscillatory alpha activity is of particular interest, as it may be key in the modulation of multisensory integration. Full article

Other

6 pages, 221 KiB  
Opinion
Perceptual Doping: A Hypothesis on How Early Audiovisual Speech Stimulation Enhances Subsequent Auditory Speech Processing
by Shahram Moradi and Jerker Rönnberg
Brain Sci. 2023, 13(4), 601; https://doi.org/10.3390/brainsci13040601 - 01 Apr 2023
Cited by 1 | Viewed by 1002
Abstract
Face-to-face communication is one of the most common means of communication in daily life. We benefit from both auditory and visual speech signals that lead to better language understanding. People prefer face-to-face communication when access to auditory speech cues is limited because of [...] Read more.
Face-to-face communication is one of the most common means of communication in daily life. We benefit from both auditory and visual speech signals that lead to better language understanding. People prefer face-to-face communication when access to auditory speech cues is limited because of background noise in the surrounding environment or in the case of hearing impairment. We demonstrated that an early, short period of exposure to audiovisual speech stimuli facilitates subsequent auditory processing of speech stimuli for correct identification, but early auditory exposure does not. We called this effect “perceptual doping” as an early audiovisual speech stimulation dopes or recalibrates auditory phonological and lexical maps in the mental lexicon in a way that results in better processing of auditory speech signals for correct identification. This short opinion paper provides an overview of perceptual doping and how it differs from similar auditory perceptual aftereffects following exposure to audiovisual speech materials, its underlying cognitive mechanism, and its potential usefulness in the aural rehabilitation of people with hearing difficulties. Full article
Back to TopTop