Top Banner
Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex Catherine Perrodin a , Christoph Kayser b , Nikos K. Logothetis a,c , and Christopher I. Petkov d,1 a Department of Physiology of Cognitive Processes, Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany; b Institute of Neuroscience and Psychology, University of Glasgow, Glasgow G12 8QB, United Kingdom; c Division of Imaging Science and Biomedical Engineering, University of Manchester, Manchester M13 9PT, United Kingdom; and d Institute of Neuroscience, Newcastle University Medical School, Newcastle upon Tyne NE2 4HH, United Kingdom Edited by Jon H. Kaas, Vanderbilt University, Nashville, TN, and approved December 2, 2014 (received for review July 7, 2014) When social animals communicate, the onset of informative content in one modality varies considerably relative to the other, such as when visual orofacial movements precede a vocalization. These naturally occurring asynchronies do not disrupt intelligibility or perceptual coherence. However, they occur on time scales where they likely affect integrative neuronal activity in ways that have remained unclear, especially for hierarchically downstream regions in which neurons exhibit temporally imprecise but highly selective responses to communication signals. To address this, we exploited naturally occurring face- and voice-onset asynchronies in primate vocalizations. Using these as stimuli we recorded cortical oscillations and neuronal spiking responses from functional MRI (fMRI)-localized voice-sensitive cortex in the anterior temporal lobe of macaques. We show that the onset of the visual face stim- ulus resets the phase of low-frequency oscillations, and that the facevoice asynchrony affects the prominence of two key types of neuronal multisensory responses: enhancement or suppression. Our findings show a three-way association between temporal delays in audiovisual communication signals, phase-resetting of ongoing oscillations, and the sign of multisensory responses. The results reveal how natural onset asynchronies in cross-sensory inputs regulate network oscillations and neuronal excitability in the voice-sensitive cortex of macaques, a suggested animal model for human voice areas. These findings also advance predictions on the impact of multisensory input on neuronal processes in face areas and other brain regions. oscillations | neurons | communication | voice | multisensory H ow the brain parses multisensory input despite the variable and often large differences in the onset of sensory signals across different modalities remains unclear. We can maintain a coherent multisensory percept across a considerable range of spatial and temporal discrepancies (14): For example, auditory and visual speech signals can be perceived as belonging to the same multisensory objectover temporal windows of hundreds of milliseconds (57). However, such misalignment can drasti- cally affect neuronal responses in ways that may also differ be- tween brain regions (810). We asked how natural asynchronies in the onset of face/voice content in communication signals would affect voice-sensitive cortex, a region in the ventral ob- jectpathway (11) where neurons (i ) are selective for auditory features in communication sounds (1214), (ii ) are influenced by visual facecontent (12), and (iii ) display relatively slow and temporally variable responses in comparison with neurons in primary auditory cortical or subcortical structures (1416). Neurophysiological studies in human and nonhuman animals have provided considerable insights into the role of cortical oscillations during multisensory conditions and for parsing speech. Cortical oscillations entrain to the slow temporal dynamics of natural sounds (1720) and are thought to reflect the excitability of local networks to sensory inputs (2124). Moreover, at least in auditory cortex, the onset of sensory input from the nondominant modality can reset the phase of ongoing auditory cortical oscil- lations (8, 25, 26), modulating the processing of subsequent acoustic input (8, 18, 22, 2628). Thus, the question arises as to whether and how the phase of cortical oscillations in voice- sensitive cortex is affected by visual input. There is limited evidence on how asynchronies in multisensory stimuli affect cortical oscillations or neuronal multisensory in- teractions. Moreover, as we consider in the following, there are some discrepancies in findings between studies, leaving unclear what predictions can be made for regions beyond the first few stages of auditory cortical processing. In general there are two types of multisensory response modulations: Neuronal firing rates can be either suppressed or enhanced in multisensory compared with unisensory conditions (9, 12, 25, 29, 30). In the context of audiovisual communication Ghazanfar et al. (9) showed that these two types of multisensory influences are not fixed. Rather, they reported that the proportion of suppressed and enhanced multisensory responses in auditory cortical local- field potentials varies depending on the natural temporal asyn- chrony between the onset of visual (face) and auditory (voice) information. They interpret their results as an approximately linear change from enhanced to suppressed responses with in- creasing asynchrony between face movements and vocalization onset. In contrast, Lakatos et al. (8) found a cyclic, rather than Significance Social animals often combine vocal and facial signals into a co- herent percept, despite variable misalignment in the onset of informative audiovisual content. However, whether and how natural misalignments in communication signals affect in- tegrative neuronal responses is unclear, especially for neurons in recently identified temporal voice-sensitive cortex in non- human primates, which has been suggested as an animal model for human voice areas. We show striking effects on the excit- ability of voice-sensitive neurons by the variable misalignment in the onset of audiovisual communication signals. Our results allow us to predict the state of neuronal excitability from the cross-sensory asynchrony in natural communication signals and suggest that the general pattern that we observed would generalize to face-sensitive cortex and certain other brain areas. Author contributions: C.P. and C.I.P. designed research; C.P. performed research; C.K. and N.K.L. contributed new reagents/analytic tools; C.P. analyzed data; and C.P. and C.I.P. wrote the paper. The authors declare no conflict of interest. This article is a PNAS Direct Submission. Freely available online through the PNAS open access option. 1 To whom correspondence should be addressed. Email: [email protected]. This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. 1073/pnas.1412817112/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1412817112 PNAS Early Edition | 1 of 6 NEUROSCIENCE
14

Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

Feb 26, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

Natural asynchronies in audiovisual communicationsignals regulate neuronal multisensory interactionsin voice-sensitive cortexCatherine Perrodina, Christoph Kayserb, Nikos K. Logothetisa,c, and Christopher I. Petkovd,1

aDepartment of Physiology of Cognitive Processes, Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany; bInstitute of Neuroscienceand Psychology, University of Glasgow, Glasgow G12 8QB, United Kingdom; cDivision of Imaging Science and Biomedical Engineering, University ofManchester, Manchester M13 9PT, United Kingdom; and dInstitute of Neuroscience, Newcastle University Medical School, Newcastle upon Tyne NE2 4HH,United Kingdom

Edited by Jon H. Kaas, Vanderbilt University, Nashville, TN, and approved December 2, 2014 (received for review July 7, 2014)

When social animals communicate, the onset of informativecontent in one modality varies considerably relative to the other,such as when visual orofacial movements precede a vocalization.These naturally occurring asynchronies do not disrupt intelligibilityor perceptual coherence. However, they occur on time scaleswhere they likely affect integrative neuronal activity in ways thathave remained unclear, especially for hierarchically downstreamregions in which neurons exhibit temporally imprecise but highlyselective responses to communication signals. To address this, weexploited naturally occurring face- and voice-onset asynchronies inprimate vocalizations. Using these as stimuli we recorded corticaloscillations and neuronal spiking responses from functional MRI(fMRI)-localized voice-sensitive cortex in the anterior temporallobe of macaques. We show that the onset of the visual face stim-ulus resets the phase of low-frequency oscillations, and that theface–voice asynchrony affects the prominence of two key types ofneuronal multisensory responses: enhancement or suppression.Our findings show a three-way association between temporaldelays in audiovisual communication signals, phase-resetting ofongoing oscillations, and the sign of multisensory responses. Theresults reveal how natural onset asynchronies in cross-sensoryinputs regulate network oscillations and neuronal excitability inthe voice-sensitive cortex of macaques, a suggested animal modelfor human voice areas. These findings also advance predictions onthe impact of multisensory input on neuronal processes in faceareas and other brain regions.

oscillations | neurons | communication | voice | multisensory

How the brain parses multisensory input despite the variableand often large differences in the onset of sensory signals

across different modalities remains unclear. We can maintaina coherent multisensory percept across a considerable range ofspatial and temporal discrepancies (1–4): For example, auditoryand visual speech signals can be perceived as belonging to thesame multisensory “object” over temporal windows of hundredsof milliseconds (5–7). However, such misalignment can drasti-cally affect neuronal responses in ways that may also differ be-tween brain regions (8–10). We asked how natural asynchroniesin the onset of face/voice content in communication signalswould affect voice-sensitive cortex, a region in the ventral “ob-ject” pathway (11) where neurons (i) are selective for auditoryfeatures in communication sounds (12–14), (ii) are influenced byvisual “face” content (12), and (iii) display relatively slow andtemporally variable responses in comparison with neurons inprimary auditory cortical or subcortical structures (14–16).Neurophysiological studies in human and nonhuman animals

have provided considerable insights into the role of corticaloscillations during multisensory conditions and for parsing speech.Cortical oscillations entrain to the slow temporal dynamics ofnatural sounds (17–20) and are thought to reflect the excitabilityof local networks to sensory inputs (21–24). Moreover, at least in

auditory cortex, the onset of sensory input from the nondominantmodality can reset the phase of ongoing auditory cortical oscil-lations (8, 25, 26), modulating the processing of subsequentacoustic input (8, 18, 22, 26–28). Thus, the question arises as towhether and how the phase of cortical oscillations in voice-sensitive cortex is affected by visual input.There is limited evidence on how asynchronies in multisensory

stimuli affect cortical oscillations or neuronal multisensory in-teractions. Moreover, as we consider in the following, there aresome discrepancies in findings between studies, leaving unclearwhat predictions can be made for regions beyond the first fewstages of auditory cortical processing. In general there are twotypes of multisensory response modulations: Neuronal firingrates can be either suppressed or enhanced in multisensorycompared with unisensory conditions (9, 12, 25, 29, 30). In thecontext of audiovisual communication Ghazanfar et al. (9)showed that these two types of multisensory influences are notfixed. Rather, they reported that the proportion of suppressedand enhanced multisensory responses in auditory cortical local-field potentials varies depending on the natural temporal asyn-chrony between the onset of visual (face) and auditory (voice)information. They interpret their results as an approximatelylinear change from enhanced to suppressed responses with in-creasing asynchrony between face movements and vocalizationonset. In contrast, Lakatos et al. (8) found a cyclic, rather than

Significance

Social animals often combine vocal and facial signals into a co-herent percept, despite variable misalignment in the onset ofinformative audiovisual content. However, whether and hownatural misalignments in communication signals affect in-tegrative neuronal responses is unclear, especially for neuronsin recently identified temporal voice-sensitive cortex in non-human primates, which has been suggested as an animal modelfor human voice areas. We show striking effects on the excit-ability of voice-sensitive neurons by the variable misalignmentin the onset of audiovisual communication signals. Our resultsallow us to predict the state of neuronal excitability from thecross-sensory asynchrony in natural communication signals andsuggest that the general pattern that we observed wouldgeneralize to face-sensitive cortex and certain other brain areas.

Author contributions: C.P. and C.I.P. designed research; C.P. performed research; C.K. andN.K.L. contributed new reagents/analytic tools; C.P. analyzed data; and C.P. and C.I.P.wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Freely available online through the PNAS open access option.1To whom correspondence should be addressed. Email: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1412817112/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1412817112 PNAS Early Edition | 1 of 6

NEU

ROSC

IENCE

Page 2: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

linear, pattern of multisensory enhancement and suppression inauditory cortical neuronal responses as a function of increasingauditory–somatosensory stimulus onset asynchrony. This latterresult suggests that the proportion of suppressed/enhancedmultisensory responses varies nonlinearly (i.e., cyclically) with therelative onset timing of cross-modal stimuli. Although suchresults highlight the importance of multisensory asynchroniesin regulating neural excitability, the differences between thestudies prohibit generalizing predictions to other brain areas andthus leave the general principles unclear.In this study we aimed to address how naturally occurring

temporal asynchronies in primate audiovisual communicationsignals affect both cortical oscillations and neuronal spiking ac-tivity in a voice-sensitive region. Using a set of human andmonkey dynamic faces and vocalizations exhibiting a broadrange of audiovisual onset asynchronies (Fig. 1), we demon-strate a three-way association between face–voice onset asyn-chrony, cross-modal phase resetting of cortical oscillations, anda cyclic pattern of dynamically changing proportions of sup-pressed and enhanced neuronal multisensory responses.

ResultsWe targeted neurons for extracellular recordings in a right-hemisphere voice-sensitive area on the anterior supratemporalplane of the rhesus macaque auditory cortex. This area residesanterior to tonotopically organized auditory fields (13). Recentwork has characterized the prominence and specificity of mul-tisensory influences on neurons in this area: Visual influences onthese neurons are typically characterized by nonlinear multi-sensory interactions (12), with audiovisual responses being eithersuperadditive or subadditive in relation to the sum of theresponses to the unimodal stimuli (Fig. 2A). The terms “en-hanced” and “suppressed” are often used to refer to a multisen-sory difference relative to the maximally responding unisensorycondition. These types of multisensory responses can be compa-rable to superadditive/subadditive effects (i.e., relative to thesummed unimodal responses) if there is a weak or no response toone of the stimulus conditions (e.g., visual stimuli in auditorycortex). In our study the effects are always measured in relation tothe summed unimodal response, yet we use the terms enhanced/suppressed simply for readability throughout.We investigated whether the proportions of suppressed/en-

hanced neuronal spiking responses covary with the asynchronybetween the visual and auditory stimulus onset, or other soundfeatures. The visual–auditory delays (VA delays) ranged from 77to 219 ms (time between the onset of the dynamic face video andthe onset of the vocalization; Fig. 1). When the vocalization

stimuli were sorted according to their VA delays, we found thatthe relative proportion of units showing either multisensory en-hancement or suppression of their firing rates strongly dependedon VA delay. Multisensory enhancement was most likely formidrange VA delays between 109 and 177 ms, whereas sup-pression was more likely for very short VA delays (77–89 ms) andlong VA delays (183–219 ms; Fig. 2B).We first ruled out trivial explanations for the association be-

tween VA delays and the proportions of the two forms of mul-tisensory influences. The magnitude of the unisensory responsesand the prominence of visual modulation were found to becomparable for both midrange and long VA delays (Fig. S1).Moreover, we found little support for any topographic differencesin the multisensory response types (Fig. S2), and no other featureof the audiovisual communication signals, such as call type, callerbody size, or caller identity, was as consistently associated with thedirection of visual influences (12). Together, these observationsunderscore the association between VA delays and the form ofmultisensory influences. Interestingly, the relationship betweenthe type of multisensory interaction and the VA delay seemed tofollow a cyclic pattern and was well fit by an 8.4-Hz sinusoidalfunction (red curve in Fig. 2B; adjusted R2 = 0.583). Fitting cyclicfunctions with other time scales explained much smaller amountsof the data variance (e.g., 4 Hz: adjusted R2 = 0.151; 12 Hz:adjusted R2 = −0.083), suggesting that a specific time scalearound 8 Hz underlies the multisensory modulation pattern.We confirmed that this result was robust to the specific way of

quantification. First, the direction of multisensory influences wasstable throughout the 400-ms response window used for analysis(Fig. S3), and the cyclic modulation pattern was evident whenusing a shorter response window (Fig. S4). Second, this cyclicpattern was also evident in well-isolated single units from thedataset (Fig. S5), and a similar pattern was observed usinga nonbinary metric of multisensory modulation (additivity index;Fig. S6).Given the cyclic nature of the multisensory interaction and VA

delay association, we next asked whether and how this relates tocortical oscillations in the local-field potential (LFP). To assessthe oscillatory context of the spiking activity for midrange vs.long VA delays, we computed the stimulus-evoked broadbandLFP response to the auditory, visual, and audiovisual stimula-tion. The grand average evoked potential across sites and stimulirevealed strong auditory- and audiovisually evoked LFPs, in-cluding a visually-evoked LFP (Fig. S7A). We observed thatpurely visual stimulation elicited a significant power increase inthe low frequencies (5–20 Hz; bootstrapped significance test, P <0.05, Bonferroni corrected; Fig. 3A). This visual response was

Fig. 1. Audiovisual primate vocalizations and visual–auditory delays. (A–C) Examples of audiovisual rhesus macaque coo (A and B) and grunt (C) vocalizationsused for stimulation and their respective VA delays (time interval between the onset of mouth movement and the onset of the vocalization; red bars). The videostarts at the onset of mouth movement, with the first frame showing a neutral facial expression, followed by mouth movements associated with the vocali-zation. Gray lines indicate the temporal position of the representative video frames (top row). The amplitude waveforms (middle row) and the spectrograms(bottom row) of the corresponding auditory component of the vocalization are displayed below.

2 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1412817112 Perrodin et al.

Page 3: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

accompanied by a significant increase in intertrial phase co-herence, restricted to the 5- to 10-Hz frequency band, between 130and 350 ms after video onset (phase coherence values signifi-cantly larger than a bootstrapped null distribution of time-frequency values; P < 0.05, Bonferroni corrected; Fig. 3B). In

contrast, auditory and audiovisual stimulation yielded broad-band increases in LFP power (Fig. S7B) and an increased phasecoherence spanning a wider band (5–45 Hz; Fig. S7C). Thereby,the response induced by the purely visual stimuli suggests thatdynamic faces may influence the state of slow rhythmic activity inthis temporal voice area via phase resetting of ongoing low-fre-quency oscillations. Noteworthy, the time scale of the relevantbrain oscillations (5–10 Hz) and the time at which the phasecoherence increases (∼100–350 ms; Fig. 3B) match the timescale (8 Hz) and range of VA delays at which the cyclic patternof multisensory influences on the spiking activity emerged(Fig. 2B).We found little evidence that the phase resetting was species-

specific, because both human and monkey dynamic face stimulielicited a comparable increase in intertrial phase coherence (Fig.S8A). Similarly, the relative proportion of enhanced/suppressedunits was not much affected when a coo or grunt call wasreplaced with a phase-scrambled version (that preserves theoverall frequency content but removes the temporal envelope;Fig. S8B and ref. 12). Both observations suggest that the underlying

Fig. 2. VA delay and the direction (sign) of multisensory interactions. (A)Example spiking responses with nonlinear visual modulation of auditoryactivity: enhancement (superadditive multisensory effect, Top) and sup-pression (subadditive multisensory, Bottom). The horizontal gray line indi-cates the duration of the auditory stimulus, and the light gray boxrepresents the 400-ms peak-centered response window. Bar plots indicatethe response amplitudes in the 400-ms response window (shown is mean ±SEM). The P values refer to significantly nonlinear audiovisual interactions,defined by comparing the audiovisual response with all possible summationsof the unimodal responses [A vs. (A + V), z test, *P < 0.01]. (B) Proportions ofenhanced and suppressed multisensory units by stimulus, arranged asa function of increasing VA delays (n = 81 units). Note that the bars arespaced at equidistant intervals for display purposes. Black dots indicate theproportion of enhanced units for each VA delay value, while respecting thereal relative positions of VA delay values. The red line represents the sinu-soid with the best-fitting frequency (8.4 Hz, adjusted R2 = 0.58).

Fig. 3. Visually evoked oscillatory context surrounding the spiking activityat different audiovisual asynchronies in voice-sensitive cortex. (A) Time–frequency plot of averaged single-trial spectrograms in response to visualface stimulation. The population-averaged spectrogram has been baseline-normalized for display purposes. (B) Time–frequency plot of average phasecoherence values across trials. The color code reflects the strength of phasealignment evoked by the visual stimuli. The range of values in A and B arethe same as in Fig. S7, to allow for closer comparisons. Black contours in-dicate the pixels with significant power or phase coherence increase, iden-tified using a bootstrapping procedure (right-tailed z test, P < 0.05,Bonferroni corrected). (C) Distribution of the theta/low-alpha (5- to 10-Hzband) phase values at the time of auditory response, for vocalizations withmidrange VA delays (n = 52 sites). (D) Distribution of theta/low-alpha bandphase values at sound arrival, for the stimuli with long VA delays. The ver-tical black bar indicates the value of the circular median phase angle.

Perrodin et al. PNAS Early Edition | 3 of 6

NEU

ROSC

IENCE

Page 4: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

processes are not stimulus-specific but reflect more generic visualmodulation of voice area excitability.The frequency band exhibiting phase reset included the ∼8-Hz

frequency at which we found the cyclic variation of multisensoryenhancement and suppression in the firing rates (Fig. 2B). Thus,we next asked whether the oscillatory context in this band couldpredict the differential multisensory influence on neuronal firingresponses at different VA delays. We computed the value of thevisually evoked phase in the relevant 5- to 10-Hz theta band foreach recording site at the time at which the vocalization soundfirst affects these auditory neurons’ responses. This time pointwas computed as the sum of the VA delay for each vocalizationstimulus and the sensory processing time, which we estimatedusing the mean auditory latency of neurons in the voice area(110 ms; see ref. 12). The phase distributions of the theta bandoscillations at midrange and long VA delays is shown in Fig. 3 Cand D. Both distributions deviated significantly from uniformity(Rayleigh test, P = 1.2 × 10−25, Z = 41.1 for midrange VA delays;P = 0.035, Z = 3.4 for long VA delays). For midrange VA delaysthe phase distributions were centered around a median phaseangle of 1.6 rad (92°), and for long VA delays at −1.9 rad(−109°). The phase distributions significantly differed acrossmidrange and long VA delays (Kuiper two-sample test: P =0.001, K = 5,928). These results show that the auditory stream ofthe multisensory signal reaches voice-sensitive neurons in a dif-ferent oscillatory context for the two VA-delay durations. Inparticular, at midrange VA delays the preferred phase angle (ca.π/2) corresponds to the descending slope of ongoing oscillationsand is typically considered the “ideal” excitatory phase: Thespiking response to a stimulus arriving at that phase is enhanced(8, 25, 28). In contrast, at long VA delays the preferred phasevalue corresponds to a phase of less optimal neuronal excitability.Finally, we directly probed the association between the cross-

modal phase at the time of auditory response onset and the di-rection of subsequent multisensory spiking responses. We la-beled each unit that displayed significant audiovisual interactionswith the corresponding visually evoked theta phase angle im-mediately before the onset of the vocalization response. Thisrevealed that the proportions of enhanced and suppressedspiking responses significantly differed between negative andpositive phase values [χ2 test, P = 0.0081, χ2(1, n = 41) = 7.02].Multisensory enhancement was more frequently associated withpositive phase angles (10/27 = 37% of units) compared withnegative phase angles (3/14 = 21% of units; Fig. S9). In sum-mary, the findings show three-way relationships between thevisual–auditory delays in communication signals, the phase oftheta oscillations, and a cyclically varying proportion of sup-pressed vs. enhanced multisensory neuronal responses in voice-sensitive cortex.

DiscussionThis study probed neurons in a primate voice-sensitive region,which forms a part of the ventral object processing stream (11)and has links to human functional MRI (fMRI)-identified tem-poral voice areas (31, 32). The results show considerable impacton the state of global and local neuronal excitability in this re-gion by naturally occurring audiovisual asynchronies in the onsetof informative content. The main findings show a three-way as-sociation between (i) temporal asynchronies in the onset of vi-sual dynamic face content and the onset of vocalizations, (ii) thephase of low-frequency neuronal oscillations, and (iii) cyclicallyvarying proportions of enhanced vs. suppressed multisensoryneuronal responses.Prior studies do not provide a consistent picture of how cross-

sensory stimulus asynchronies affect neuronal multisensory re-sponses. One study evaluated the impact on audiovisual modu-lations in LFPs around core and belt auditory cortex using naturalface–voice asynchronies in videos of vocalizing monkeys (9). The

authors reported a gradual increase in the prominence of mul-tisensory suppression with increasing visual–auditory onsetdelays. However, another study recording spiking activityfrom auditory cortex found that shifting somatosensory nervestimulation relative to sound stimulation with tones resulted ina cyclic, rather than linear, pattern of alternating proportions ofenhanced and suppressed spiking responses (8). A third studymapped the neural window of multisensory interaction in A1using transient audiovisual stimuli with a range of onsets (25),identifying a fixed time window (20–80 ms) in which sounds in-teract in a mostly suppressive manner. Finally, a fourth studyrecorded LFPs in the superior-temporal sulcus (STS) and foundthat different frequency bands process audiovisual input streamsdifferently (10). The study also showed enhancement for shortvisual–auditory asynchronies in the alpha band and weak to nodependancy on visual–auditory asynchrony in the other LFPfrequency bands, including theta (10). Given the variability inresults, the most parsimonious interpretation was that multi-sensory asynchrony effects on neuronal excitability are stimulus-,neuronal response measure-, and/or brain area-specific.Comparing our findings to these previous results suggests that

the multisensory effects are not necessarily stimulus-specific andthe differences across brain areas might be more quantitativethan qualitative. Specifically, our data from voice-sensitive cortexshow that the direction of audiovisual influences on spiking ac-tivity varies cyclically as a function of VA delay. This finding ismost consistent with the data from auditory cortical neuronsshowing cyclic patterns of suppressed/enhanced responses tosomatosensory–auditory stimulus asynchronies (8). Togetherthese results suggest a comparable impact on cortical oscillationsand neuronal multisensory modulations by asynchronies in dif-ferent types of multisensory stimuli (8, 25). Interestingly, whenlooked at in detail some of the other noted studies (9, 25) showat least partial evidence for a cyclic pattern of multisensoryinteractions.Some level of regional specificity is expected, given that, for

example, relatively simple sounds are not a very effective drivefor neurons in voice-sensitive cortex (13, 14). However, we didnot find strong evidence for any visual or auditory stimulusspecificity in the degree of phase resetting or the proportions ofmultisensory responses. Hence, it may well be that some oscil-latory processes underlying multisensory interactions reflectgeneral mechanisms of cross-modal visual influences, which areshared between voice-sensitive and earlier auditory cortex. It isan open question whether regions further downstream, such asthe frontal cortex or STS (33, 34), might integrate cross-sensoryinput differently. In any case, our results emphasize the rhyth-micity underlying multisensory interactions and hence generatespecific predictions for other sensory areas such as face-sensitivecortex (35).The present results predict qualitatively similar effects for

face-sensitive areas in the ventral temporal lobe, with some keyquantitative differences in the timing of neuronal interactions, asfollows. The dominant input from the visual modality into face-sensitive neurons would drive face-selective spiking responseswith a latency of ∼100 ms after the onset of mouth movement(35–37). Nearly coincident cross-modal input into face-sensitiveareas from the nondominant auditory modality would affect thephase of the ongoing low-frequency oscillations and likely affectface areas at about the same time as the face-selective spikingresponses (38) or later for any VA delay. Based on our results,we predict a comparable cyclic pattern of auditory modulation ofthe visual spiking activity, as a function of VA delay. However,because in this case the dominant modality for face-area neuronsis visual, and in natural conditions visual onset often precedesvocalization onset, the pattern of excitability is predicted to besomewhat phase-shifted in relation to those from the voice area.For example, shortly after the onset of face-selective neuronal

4 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1412817112 Perrodin et al.

Page 5: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

responses, perfectly synchronous stimulation (0 ms), or thosewith relatively short VA delays (∼75 ms), would be associatedpredominantly with multisensory suppression. Interestingly, someevidence for this prediction can already be seen in the STSresults of a previous study (39) using synchronous face–voicestimulation.The general mechanism of cross-modal phase resetting of

cortical oscillations and its impact on neuronal response modu-lation has been described in the primary auditory cortex ofnonhuman primates (8, 25) and in auditory and visual cortices inhumans (27, 40). Prior work has also highlighted low-frequency(e.g., theta) oscillations and has hypothesized that one key roleof phase-resetting mechanisms is to align cortical excitability toimportant events in the stimulus stream (8, 22, 24, 26). Our studyextends these observations to voice-sensitive cortex: We ob-served that visual stimulation resets the phase of theta/low-alphaoscillations and that the resulting multisensory modulation offiring rates depends on the audiovisual onset asynchrony. Wealso show how cross-sensory asynchronies in communicationsignals affect the phase of low-frequency cortical oscillations andregulate periods of neuronal excitability.Cross-modal perception can accommodate considerable tem-

poral asynchronies between the individual modalities before thecoherence of the multimodal percept breaks down (5–7), incontrast to the high perceptual sensitivity to unisensory inputalignment (41). For example, observers easily accommodate theasynchrony between the onset of mouth movement and the onsetof a vocalization sound, which can be up to 300 ms in monkeyvocalizations (9) or human speech (42). More generally, a largebody of behavioral literature shows that multisensory perceptualfusion can be robust over extended periods of cross-sensoryasynchrony, without any apparent evidence for “cyclic” fluctua-tions in the coherence of the multisensory percept (5–7). Giventhe variety of multisensory response types elicited during periodsin which stable perceptual fusion should occur, our results un-derscore the functional role of both enhanced and suppressedspiking responses (43, 44). However, this perceptual robustnessis in apparent contrast to the observed rhythmicity of neuronalintegrative processes (8, 9, 25, 30).It could be that audiovisual asynchronies and their cyclic

effects on neuronal excitability are associated with subtle fluc-tuations in perceptual sensitivity that are not detected withsuprathreshold stimuli. Evidence supporting this possibility inhuman studies shows that the phase of entrained or cross-mod-ally reset cortical oscillations can have subtle effects on auditoryperception (24, 45, 46), behavioral response times (26), and vi-sual detection thresholds (23, 40, 47, 48). Previous work has alsoshown both that the degree of multisensory perceptual binding ismodulated by stimulus type (5), task (49), and prior experience(50), and that oscillatory entrainment adjusts as a function ofselective attention to visual or auditory stimuli (51, 52). Giventhis it seemed important to first investigate multisensory inter-actions in voice-sensitive cortex during a visual fixation task ir-relevant to the specific face/voice stimuli, so as to minimize task-dependent influences. This task-neutral approach is also relevantgiven that the contribution of individual cortical areas to multi-sensory voice perception remains unclear. Future work needs tocompare the patterns of multisensory interactions across tem-poral lobe regions and to identify their specific impact onperception.By design, the start of the videos in our experiment is in-

dicative of the onset of a number of different sources of visuallyinformative content. Although articulatory mouth movementsseem to dominantly attract the gaze of primates (53, 54), a con-tinuous visual stream might offer a number of time points atwhich visual input can influence the phase of the ongoing audi-tory cortical oscillations by capturing the animal’s attention andgaze direction (55). Starting from our results, future work can

specify whether and how subsequent audiovisual fluctuations inthe onset of informative content alter or further affect the de-scribed multisensory processes.In summary, our findings show that temporal asynchronies in

audiovisual face/voice communication signals seem to reset thephase of theta-range cortical oscillations and regulate the twokey types of multisensory neuronal interactions in primate voice-sensitive cortex. This allows predicting the form of local andglobal neuronal multisensory responses by calculating the natu-rally occurring asynchrony in the audiovisual input signal. Thisstudy can serve as a link between neuron-level work in non-human animal models and work using noninvasive approaches inhumans to study the neurobiology of multisensory processes.

Materials and MethodsFull methodological details are provided in SI Materials and Methods and aresummarized here. Two adult male Rhesus macaques (Macaca mulatta) par-ticipated in these experiments. All procedures were approved by the localauthorities (Regierungspräsidium Tübingen, Germany) and were in fullcompliance with the guidelines of the European Community (EUVD 86/609/EEC) for the care and use of laboratory animals.

Audiovisual Stimuli. Naturalistic audiovisual stimuli consisted of digital videoclips (recorded with a Panasonic NV-GS17 digital camera) of a set of “coo”and “grunt” vocalizations by rhesus monkeys and recordings of humansimitating monkey coo vocalizations. The stimulus set included n = 10vocalizations. For details see ref. 12 and SI Materials and Methods.

Electrophysiological Recordings. Electrophysiological recordings were ob-tained while the animals performed a visual fixation task. Only data fromsuccessfully completed trials were analyzed further (SI Materials and Meth-ods). The two macaques had previously participated in fMRI experiments tolocalize their voice-preferring regions, including the anterior voice-identitysensitive clusters (see refs. 13 and 31). A custom-made multielectrode systemwas used to independently advance up to five epoxy-coated tungstenmicroelectrodes (0.8–2 MOhm impedance; FHC Inc.). The electrodes wereadvanced to the MRI-calculated depth of the anterior auditory cortex on thesupratemporal plane (STP) through an angled grid placed on the recordingchamber. Electrophysiological signals were amplified using an amplifiersystem (Alpha Omega GmbH), filtered between 4 Hz and 10 kHz (four-pointButterworth filter) and digitized at a 20.83-kHz sampling rate. For furtherdetails see SI Materials and Methods and ref. 13.

The data were analyzed in MATLAB (MathWorks). The spiking activity wasobtained by first high-pass filtering the recorded broadband signal at 500 Hz(third-order Butterworth filter) then extracted offline using commercial spike-sorting software (Plexon Offline Sorter; Plexon Inc.). Spike times were saved ata resolution of 1 ms. Peristimulus time histograms were obtained using 5-msbins and 10-ms Gaussian smoothing (FWHM). LFPs were obtained by low-passfiltering the recorded broadband signal at 150 Hz (third-order Butterworthfilter). The broadband evoked potentials were full-wave rectified. For time-frequency analysis, trial-based activity between 5 and 150 Hz was filtered into5-Hz-wide bands using a fourth-order Butterworth filter. Instantaneous powerand phase were extracted using the Hilbert transform on each frequency band.

Data Analysis. A unit was considered auditory-responsive if its average re-sponse amplitude exceeded 2 SD units from its baseline activity duringa continuous period of at least 50 ms, for any of the experimental sounds inthe set of auditory or audiovisual stimuli. A recording site was included in theLFP analysis if at least one unit recorded at this site showed a significantauditory response. For each unit and each stimulus, the mean of the baselineresponse was subtracted to compensate for fluctuations in spontaneousactivity. Response amplitudes were defined as the average response in a400-mswindow centered on the peak of the trial-averaged stimulus response.The same window was used to compute the auditory, visual, and audiovisualresponse amplitudes for each stimulus.

Multisensory interactions were assessed individually for each unit witha significant response to sensory stimulation (A, V, or AV). A sensory-responsive unit was termed “nonlinear multisensory” if its response to theaudiovisual stimulus was significantly different from a linear (additive) sumof the two unimodal responses [AV ∼ (A + V)]. This was computed for eachunit and for each stimulus that elicited a significant sensory response, byimplementing a randomization procedure (25, 56) described in more detailsin SI Materials and Methods.

Perrodin et al. PNAS Early Edition | 5 of 6

NEU

ROSC

IENCE

Page 6: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

The parameters and goodness of fit of a sinusoid of the form F(x) = a0 +a1*cos(ωx) + b1*sin(ωx) were estimated using the MATLAB curve-fitting toolbox.To compare the differential impact of midrange and long VA delays on neuronalactivity, the analysis focused on vocalizations representative of midrange (109and 129 ms) and long (205 and 219 ms) VA delays. The significance of stimulus-evoked increase in phase coherence was assessed using a randomization pro-cedure. For each frequency band, a bootstrapped distribution of mean phasecoherence was created by randomly sampling n = 1,000 phase coherence valuesacross time bins. Time-frequency bins were deemed significant if their phase

coherence value was sufficiently larger than the bootstrapped distribution (right-tailed z test, P < 0.05, Bonferroni corrected). Statistical testing of single-trialphase data was performed using the CircStat MATLAB toolbox (57).

ACKNOWLEDGMENTS. We thank J. Obleser and A. Ghazanfar for commentson previous versions of the manuscript. This work was supported by the Max-Planck Society (C.P., C.K., and N.K.L.), Swiss National Science FoundationGrant PBSKP3_140120 (to C.P.), Wellcome Trust Grants WT092606/Z/10/Z andWT102961MA (to C.I.P.), and Biotechnology and Biological Sciences ResearchCouncil Grant BB/L027534/1 (to C.K.).

1. Shams L, Kamitani Y, Shimojo S (2000) Illusions. What you see is what you hear. Na-ture 408(6814):788.

2. Howard IP, Templeton WB (1966) Human Spatial Orientation (Wiley, London), p 533.3. Slutsky DA, Recanzone GH (2001) Temporal and spatial dependency of the ventrilo-

quism effect. Neuroreport 12(1):7–10.4. McGrath M, Summerfield Q (1985) Intermodal timing relations and audio-visual

speech recognition by normal-hearing adults. J Acoust Soc Am 77(2):678–685.5. Stevenson RA, Wallace MT (2013) Multisensory temporal integration: Task and stim-

ulus dependencies. Exp Brain Res 227(2):249–261.6. van Wassenhove V, Grant KW, Poeppel D (2007) Temporal window of integration in

auditory-visual speech perception. Neuropsychologia 45(3):598–607.7. Vatakis A, Spence C (2006) Audiovisual synchrony perception for music, speech, and

object actions. Brain Res 1111(1):134–142.8. Lakatos P, Chen CM, O’Connell MN, Mills A, Schroeder CE (2007) Neuronal oscillations

and multisensory interaction in primary auditory cortex. Neuron 53(2):279–292.9. Ghazanfar AA, Maier JX, Hoffman KL, Logothetis NK (2005) Multisensory integration

of dynamic faces and voices in rhesus monkey auditory cortex. J Neurosci 25(20):5004–5012.

10. Chandrasekaran C, Ghazanfar AA (2009) Different neural frequency bands integratefaces and voices differently in the superior temporal sulcus. J Neurophysiol 101(2):773–788.

11. Rauschecker JP, Scott SK (2009) Maps and streams in the auditory cortex: Nonhumanprimates illuminate human speech processing. Nat Neurosci 12(6):718–724.

12. Perrodin C, Kayser C, Logothetis NK, Petkov CI (2014) Auditory and visual modulationof temporal lobe neurons in voice-sensitive and association cortices. J Neurosci 34(7):2524–2537.

13. Perrodin C, Kayser C, Logothetis NK, Petkov CI (2011) Voice cells in the primatetemporal lobe. Curr Biol 21(16):1408–1415.

14. Kikuchi Y, Horwitz B, Mishkin M (2010) Hierarchical auditory processing directedrostrally along the monkey’s supratemporal plane. J Neurosci 30(39):13021–13030.

15. Bendor D, Wang X (2007) Differential neural coding of acoustic flutter within primateauditory cortex. Nat Neurosci 10(6):763–771.

16. Creutzfeldt O, Hellweg FC, Schreiner C (1980) Thalamocortical transformation of re-sponses to complex auditory stimuli. Exp Brain Res 39(1):87–104.

17. Ghitza O (2011) Linking speech perception and neurophysiology: Speech decodingguided by cascaded oscillators locked to the input rhythm. Front Psychol 2:130.

18. Giraud AL, Poeppel D (2012) Cortical oscillations and speech processing: Emergingcomputational principles and operations. Nat Neurosci 15(4):511–517.

19. Ding N, Chatterjee M, Simon JZ (2013) Robust cortical entrainment to the speechenvelope relies on the spectro-temporal fine structure. Neuroimage 88C:41–46.

20. Ng BS, Logothetis NK, Kayser C (2013) EEG phase patterns reflect the selectivity ofneural firing. Cereb Cortex 23(2):389–398.

21. Engel AK, Senkowski D, Schneider TR (2012) Multisensory integration through neuralcoherence. The Neural Bases of Multisensory Processes, Frontiers in Neuroscience, edsMurray MM, Wallace MT (CRC, Boca Raton, FL).

22. Schroeder CE, Lakatos P, Kajikawa Y, Partan S, Puce A (2008) Neuronal oscillations andvisual amplification of speech. Trends Cogn Sci 12(3):106–113.

23. Thut G, Miniussi C, Gross J (2012) The functional importance of rhythmic activity in thebrain. Curr Biol 22(16):R658–R663.

24. Ng BS, Schroeder T, Kayser C (2012) A precluding but not ensuring role of entrainedlow-frequency oscillations for auditory perception. J Neurosci 32(35):12268–12276.

25. Kayser C, Petkov CI, Logothetis NK (2008) Visual modulation of neurons in auditorycortex. Cereb Cortex 18(7):1560–1574.

26. Thorne JD, De Vos M, Viola FC, Debener S (2011) Cross-modal phase reset predictsauditory task performance in humans. J Neurosci 31(10):3853–3861.

27. van Atteveldt N, Murray MM, Thut G, Schroeder CE (2014) Multisensory integration:Flexible use of general operations. Neuron 81(6):1240–1253.

28. Lakatos P, et al. (2005) An oscillatory hierarchy controlling neuronal excitability andstimulus processing in the auditory cortex. J Neurophysiol 94(3):1904–1911.

29. Sugihara T, Diltz MD, Averbeck BB, Romanski LM (2006) Integration of auditory andvisual communication information in the primate ventrolateral prefrontal cortex.J Neurosci 26(43):11138–11147.

30. Bizley JK, Nodal FR, Bajo VM, Nelken I, King AJ (2007) Physiological and anatomicalevidence for multisensory interactions in auditory cortex. Cereb Cortex 17(9):2172–2189.

31. Petkov CI, et al. (2008) A voice region in the monkey brain. Nat Neurosci 11(3):367–374.

32. Belin P, Zatorre RJ, Lafaille P, Ahad P, Pike B (2000) Voice-selective areas in humanauditory cortex. Nature 403(6767):309–312.

33. Werner S, Noppeney U (2010) Distinct functional contributions of primary sensory andassociation areas to audiovisual integration in object categorization. J Neurosci 30(7):2662–2675.

34. Ghazanfar AA, Schroeder CE (2006) Is neocortex essentially multisensory? TrendsCogn Sci 10(6):278–285.

35. Tsao DY, Freiwald WA, Tootell RB, Livingstone MS (2006) A cortical region consistingentirely of face-selective cells. Science 311(5761):670–674.

36. Leopold DA, Bondar IV, Giese MA (2006) Norm-based face encoding by single neuronsin the monkey inferotemporal cortex. Nature 442(7102):572–575.

37. Perrett DI, Rolls ET, Caan W (1982) Visual neurones responsive to faces in the monkeytemporal cortex. Exp Brain Res 47(3):329–342.

38. Schall S, Kiebel SJ, Maess B, von Kriegstein K (2013) Early auditory sensory processingof voices is facilitated by visual mechanisms. Neuroimage 77:237–245.

39. Dahl CD, Logothetis NK, Kayser C (2010) Modulation of visual responses in the su-perior temporal sulcus by audio-visual congruency. Front Integr Neurosci 4:10.

40. Romei V, Gross J, Thut G (2012) Sounds reset rhythms of visual cortex and corre-sponding human visual perception. Curr Biol 22(9):807–813.

41. Zampini M, Guest S, Shore DI, Spence C (2005) Audio-visual simultaneity judgments.Percept Psychophys 67(3):531–544.

42. Chandrasekaran C, Trubanova A, Stillittano S, Caplier A, Ghazanfar AA (2009) Thenatural statistics of audiovisual speech. PLOS Comput Biol 5(7):e1000436.

43. Kayser C, Logothetis NK, Panzeri S (2010) Visual enhancement of the informationrepresentation in auditory cortex. Curr Biol 20(1):19–24.

44. Ohshiro T, Angelaki DE, DeAngelis GC (2011) A normalization model of multisensoryintegration. Nat Neurosci 14(6):775–782.

45. Henry MJ, Obleser J (2012) Frequency modulation entrains slow neural oscil-lations and optimizes human listening behavior. Proc Natl Acad Sci USA 109(49):20095–20100.

46. Henry MJ, Herrmann B, Obleser J (2014) Entrained neural oscillations in multiplefrequency bands comodulate behavior. Proc Natl Acad Sci USA 111(41):14935–14940.

47. Busch NA, Dubois J, VanRullen R (2009) The phase of ongoing EEG oscillations predictsvisual perception. J Neurosci 29(24):7869–7876.

48. Fiebelkorn IC, et al. (2011) Ready, set, reset: Stimulus-locked periodicity in behavioralperformance demonstrates the consequences of cross-sensory phase reset. J Neurosci31(27):9971–9981.

49. Gleiss S, Kayser C (2013) Eccentricity dependent auditory enhancement of visualstimulus detection but not discrimination. Front Integr Neurosci 7:52.

50. Powers AR, 3rd, Hillock AR, Wallace MT (2009) Perceptual training narrows thetemporal window of multisensory binding. J Neurosci 29(39):12265–12274.

51. Landau AN, Fries P (2012) Attention samples stimuli rhythmically. Curr Biol 22(11):1000–1004.

52. Lakatos P, et al. (2013) The spectrotemporal filter mechanism of auditory selectiveattention. Neuron 77(4):750–761.

53. Ghazanfar AA, Nielsen K, Logothetis NK (2006) Eye movements of monkey observersviewing vocalizing conspecifics. Cognition 101(3):515–529.

54. Lansing CR, McConkie GW (2003) Word identification and eye fixation locations invisual and visual-plus-auditory presentations of spoken sentences. Percept Psychophys65(4):536–552.

55. Lakatos P, et al. (2009) The leading sense: Supramodal control of neurophysiologicalcontext by attention. Neuron 64(3):419–430.

56. Stanford TR, Quessy S, Stein BE (2005) Evaluating the operations underlying mul-tisensory integration in the cat superior colliculus. J Neurosci 25(28):6499–6508.

57. Berens P (2009) CircStat: A MATLAB toolbox for circular statistics. J Stat Softw 31(10):1–21.

6 of 6 | www.pnas.org/cgi/doi/10.1073/pnas.1412817112 Perrodin et al.

Page 7: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

Supporting InformationPerrodin et al. 10.1073/pnas.1412817112SI Materials and MethodsVisual Stimuli. The videos were acquired at 25 frames per second(640 × 480 pixels), 24-bit resolution, and compressed using Indeovideo5. The stimuli were filmed while monkeys spontaneouslyvocalized, seated in a primate chair. All videos were recorded inthe same sound-attenuated booth with the same lighting con-figuration, ensuring each video had similar auditory and visualbackground. We selected the stimuli to ensure the callers’ headposition and eye gaze direction were similar across all videosplayed within one experimental run. Finally, the faces werecentered in the images, and the head size was matched for allcallers in a given experimental run to occupy similar portions ofthe visual field. Movie clips were cropped at the beginning of thefirst mouth movement, with the first frame of each video showingthe neutral facial expression. A dynamic mask and uniform blackbackground were placed around the callers’ faces to crop all butthe moving facial features, so that the entire face was visiblewhile the back of the head and neck were masked. Image con-trast and luminance for each channel (RGB) was normalized inall videos using Adobe Photoshop CS2. The vocalization stimuliwere split into two experimental runs for presentation. The videoclips were 960 and 760 ms in duration, respectively for the twoexperimental sets (see ref. 1 for more details).

Auditory Stimuli. The audio tracks were acquired at 48 kHz and16 bits resolution in stereo (PCM format). The vocalizationsounds were matched in average RMS energy using MATLAB(MathWorks) scripts. All sounds were stored as WAV files,amplified using a Yamaha amplifier (AX-496), and deliveredfrom two free-field speakers (JBL Professional), which werepositioned at ear level 70 cm from the head and 50° to the left andright. Sound presentation was calibrated using a condenser mi-crophone (4188; Brüel & Kjær) and sound level meter (2238Mediator; Brüel & Kjær) to ensure a linear (±4 dB) transferfunction of sound delivery (between 88 and 20 kHz). The in-tensity of all of the sounds was calibrated at the position of thehead to be presented at an average intensity of 65 dB soundpressure level. The duration of the auditory vocalizations was, onaverage, 402 ± 111 ms (mean ± SD; range: 271–590 ms).

Visual Fixation Task. Recordings were performed in a darkenedand sound-insulated booth (Illtec; Illbruck Acoustic GmbH) whilethe animals sat in a primate restraint chair in front of a 21-inchcolor monitor. The stimuli and stimulus conditions (such asmodality) were randomly selected for presentation. The animalswere required to restrict their eye movements to a certain visualfixation window within the video frame around the central spotfor the entire duration of the trial. The eye position was measuredusing an infrared eye-tracking system (iView X RED P/T;SensoMotoric Instruments GmbH). During the stimulation pe-riod, a visual stimulus (video sequence only), an auditory stimulus(audio track only, black screen), or an audiovisual stimulus waspresented. Successful completion of a trial resulted in a juicereward. A trial began with the appearance of a central fixationspot. Once the animal engaged in the central fixation task, dataacquisition started. A trial consisted of an initial 500-ms baselineperiod, followed by a 1,200-ms stimulation period and a 300-mspoststimulus recording time. Intertrial intervals were at least1,800 ms. The duration of the stimulation period was chosen toencompass the longest stimuli (960 ms), to ensure that thetiming was consistent across different behavioral trials. The visual

stimuli (dynamic, vocalizing primate faces) covered a visual fieldwith a 15° diameter.Monkey 1 performed visual fixation during single trials at

a time (2 s), within a 4° diameter fixation window. This subjectwas scanned anesthetized in the prior fMRI experiment used tolocalize his voice-sensitive cluster (2). Monkey 2 previously hadhis anterior voice-area localized with fMRI while conductinga visual fixation task. Because this macaque was accustomed toworking on longer fixation trials with more lenient fixation cri-terion, for this project this subject was allowed to browse thearea within which the visual stimuli were presented on themonitor (four to six consecutive trials, 8–12 s, 8°- to 20°-diameterfixation window), aborting the trial if eye movements breachedthis area.

Recording Procedures. A combination of neurological targetingsoftware, fMRI voice vs. nonvoice localizers, stereotactic coor-dinates of the voice cluster centers, and postmortem histology atthe end of the experiments were all used to guide the electro-physiological recording electrodes to the voice-sensitive clustersin each animal or ascertain the position of the recording sites.The coordinates of each electrode along the anteroposterior(AP) and mediolateral (ML) axes were noted, as were the angleof the grid and the depth of the recording sites. Experimentalrecordings were initiated if at least one electrode had LFPs orneurons that could be driven by any of a large set of searchsounds, including tones, frequency modulated sweeps, band-passed noise, clicks, musical samples, and other natural soundsfrom a large library. No attempt was made to select neurons witha particular response preference and any neuron or LFP site thatseemed responsive to sound was recorded. Once a responsive sitewas isolated, the experiment began. After data collection wascompleted each electrode was advanced at least 250 μm to a newrecording site and until the neuronal activity pattern consider-ably changed.Sites in the auditory cortex were distinguished from deeper

recording sites in the upper bank of the STS using the depth of theelectrodes, the crossing of the lateral sulcus that is devoid ofneuronal activity, the occurrence of over 2 mm of white matterbetween auditory cortex and STS, and the emergence of visualevoked potentials at deeper recording sites.

Electrophysiological Data Analysis.To increase statistical power, forthe spiking activity results we combined single and multiunitclusters for analysis. In any case, we confirmed that the maincyclic pattern of results reported in Fig. 2B, although under-powered, was also evident in well-isolated single units from thedataset (Fig. S5).When computing spiking response amplitudes, we selected the

400-ms peak response-centered window to capture the variability ofindividual neurons’ response profiles. The length of the responsewindow was chosen to match with the average duration of thesounds in this experiment. The results were largely comparable tothose from using a shorter 200-ms response window (Fig. S4). Inaddition, the multisensory effects in the spiking response profileswere found to be highly stable over time (Fig. S3). Thus, mea-suring multisensory enhancement/suppression in the broader 400-ms window seems to satisfactorily capture the effects.

Multisensory Interactions. Nonlinear multisensory units whoseresponse to the audiovisual stimulus significantly differed fromthe sum of both unimodal responses were identified using

Perrodin et al. www.pnas.org/cgi/content/short/1412817112 1 of 8

Page 8: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

a randomization procedure: A pool of all possible summations(n = #trials * #trials) of trial-based auditory and visual re-sponses for a given stimulus was created. A bootstrappeddistribution of trial-averaged, summed unimodal responseswas built by averaging n = #trials randomly sampled trial-based values of A+V responses from the pool and repeatingthis for n = 1,000 iterations. Units for which the trial-averagedaudiovisual (AV) response was sufficiently far from the boot-strapped distribution of summed unimodal (A + V) responses(z test, P < 0.05) were termed nonadditive (nonlinear) mul-tisensory. False discovery rate correction for multiple com-parisons was applied to all P values (3).The direction and amplitude of the deviation from additivity

was quantified using the following index: Additivity = 100*[AV −(A + V)]/(A + V), where A, V, and AV reflect the baseline-corrected response amplitude, averaged in the response win-dow. Positive (negative) values of the additivity index indicatesuperadditive/enhanced (subadditive/suppressed) multisensoryinteractions.

Information Associated with Fig. S1: NeuronalResponsiveness and Visual Modulation in Response toStimuli with Mid- vs. Long-Range VA DelaysIn this figure we show that the magnitude of the unisensory re-sponse and the prominence of visual modulation were compa-rable for stimuli with midrange and long VA delays (Fig. S1). Atthe level of individual units we found that the subset of stimuliwith midrange and long VA delays were as effective in elicitingauditory responses (59 units responding to the two stimuli withmidrange VA delays and 64 to those with long VA delays). Theproportion of visually modulated neurons (25% and 26% formidrange and long VA delays, respectively) was similar for bothstimulus types (χ2 test on the number of nonlinear modulatedunits, P > 0.05). However, the two subsets of stimuli triggereddifferent types of visual influences: Compared with the calls withmidrange VA delays, those with long VA delays elicited moreaudiovisual suppression (χ2 test on the number of enhanced andsuppressed multisensory units, P = 0.0070, χ2 = 7.275; Fig. S1A).Similarly, when comparing the response amplitudes across the

population of units for midrange vs. long VA delays (n = 84 units,P > 0.05), we found no differences in the auditory, visual, oraudiovisual responses (paired-sample t test on response ampli-tudes, all P > 0.05) or the magnitude of the visual modulation(paired-sample t test on abs[AV − (A + V)], P = 0.95; Fig. S1B).However, stimuli with longer VA delays were more likely to elicitaudiovisual suppression than the ones with midrange VA delays(paired-sample t test on the amplitude of nonrectified visualmodulation AV − (A + V), P = 0.04; Fig. S1C). These datasuggest qualitatively similar auditory and audiovisual processingof calls with different VA delays, with the quantitative differ-ence residing in the type of audiovisual interactions.

Information Associated with Fig. S2: TopographicOrganization of Multisensory ResponsesTo assess whether units showing more enhancement/suppressioncluster in different anatomical locations within voice-sensitivecortex, we plotted the spatial distribution of multisensoryresponse types for each of the two animals studied (Fig. S2).Sensory responsive units were distributed across voice-sensitivecortex. Multisensory units were more scattered throughout.Moreover, both enhanced and suppressed units were found in

relatively uniform distributions at various electrode penetrationsites in both animals, and there was no obvious topographicpattern.

Information Associated with Fig. S3: Time Stability ofMultisensory Spiking ResponsesTo quantify the extent to which multisensory neurons showconsistent enhancement, suppression, or a dynamically varyingmixture of both types of effects over time, we computed theproportion of time bins during sound presentation in which thedirection of the multisensory effect was consistent with the globaldirection captured by our measure in a 400-ms response window.Across the population of visually modulated units (n = 81 units),the multisensory effect direction was consistent with our globalmeasure in at least 68% of time points (average 94%) acrossdifferent time windows ranging from 5 to 200 ms (Fig. S3). Thistime-resolved analysis suggests that the direction of the multi-sensory modulation in the spiking response profiles is stable overtime for each unit and is reliably captured by the measure ofmultisensory effect direction in a 400-ms window centered oneach neuron’s peak sensory response.

Information Associated with Fig. S8: No Consistent StimulusSpecificity of Phase-Resetting and Multisensory EffectsWe first asked whether the phase-resetting effect is a general(stimulus-nonspecific) process or shows evidence for beingstimulus-specific—for instance, whether conspecific monkey faceswould elicit a stronger increase in phase coherence than hetero-specific human faces. We found very similar response patternselicited by human and monkey faces (Fig. S8A) that did notdiffer significantly (n = 52 sites, 100-ms successive time bins,t test, P > 0.05 uncorrected). This suggests that the phase reset ofongoing oscillations is comparable for faces from different pri-mate species.Next we looked more generally into the voice specificity of the

multisensory effect, by comparing the direction of multisensoryinteractions in response to faces paired with intact vocalizations,and to the same faces paired with phase-scrambled versions ofthe original vocalizations (i.e., an acoustically degraded version ofthe vocalizations that preserves the overall frequency spectrumbut eliminates all of the temporal envelope information). Becausethe face is identical in both cases and the VA delay remainsconstant, specificity to an intact vocalization would be indicatedby deviations from the proportions of enhanced vs. suppressedunits as predicted by the original VA-delay pattern in the maintext (Fig. 2B). We found that for one voice–face pair the di-rection of multisensory interactions was similar across the intactand phase-scrambled pairs (coo, Fig. S8B, first two columns,χ2 test, P > 0.05), whereas for the other stimulus the proportionof multisensory enhancement vs. suppression significantly dif-fered across pairs with the intact and the manipulated vocali-zation (grunt, Fig. S8B, last two columns, χ2 test, P = 1.99 × 10−5,X = 18.20). This suggests little, or at least inconsistent, specificityof the multisensory effect for intact vocalizations vs. other typesof sounds. This is in agreement with previous data (1) indicatingthat multisensory responses in the voice area occur in responseto different stimuli, including mismatched voice–face pairs.Together, both findings indicate that the visually evoked phase

reset and the subsequent multisensory modulation of spikingresponses are a general rather than voice- or face-specificmechanism.

1. Perrodin C, Kayser C, Logothetis NK, Petkov CI (2014) Auditory and visual modulationof temporal lobe neurons in voice-sensitive and association cortices. J Neurosci 34(7):2524–2537.

2. Perrodin C, Kayser C, Logothetis NK, Petkov CI (2011) Voice cells in the primate tem-poral lobe. Curr Biol 21(16):1408–1415.

3. Benjamini YHY (1995) Controlling the false discovery rate: A practical and powerfulapproach to multiple testing. J R Stat Soc B 57:289–300.

Perrodin et al. www.pnas.org/cgi/content/short/1412817112 2 of 8

Page 9: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

Fig. S1. Responsiveness and visual modulation in response to stimuli with midrange vs. long VA delays. (A) Summary of the type of sensory spiking responsesin the anterior voice-sensitive supratemporal plane, for each subset of stimuli (n = 59 sensory responsive units for midrange VA delays and n = 64 sensoryresponsive units for long VA delays, respectively). (B) Auditory, visual, and audiovisual response amplitudes, as well as the magnitude of the visual modulation(abs[AV − (A + V)]) were comparable in response to stimuli with midrange vs. long VA delays. (C) The nonrectified visual modulation values differed in responseto vocalizations with midrange vs. long VA delays, with more negative values for long VA delays. The box plots represent the median, upper, and lowerquartiles of the spiking response amplitudes across the population of auditory responsive units (n = 84 units). *P < 0.05; n.s., not significant.

Perrodin et al. www.pnas.org/cgi/content/short/1412817112 3 of 8

Page 10: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

Fig. S2. Topographic analysis of multisensory responses. (A and B) Spatial organization of multisensory spiking responses, displayed using the AP and MLcoordinates of the electrophysiological recording sites spanning the anterior voice-sensitive area in both animals. The stereotaxic coordinates used theFrankfurt-zero standard, where the origin is defined as the midpoint of the interaural line and the infraorbital plane. Black circles indicate the total number ofresponsive units encountered along electrode penetrations in a given location. The colored areas represent the percentage of units with significant audiovisualinteractions (red, multisensory enhanced units; green, multisensory suppressed units).

Fig. S3. Time stability of multisensory effect direction. Proportion of bins during which the multisensory (enhanced/suppressed) effect direction was consistentwith the direction calculated in a 400-ms window. Shown is the mean ± SEM (n = 81 visually modulated units). The red dotted line marks 75% of bins reflectingthe global effect direction.

Perrodin et al. www.pnas.org/cgi/content/short/1412817112 4 of 8

Page 11: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

Fig. S4. Pattern of multisensory responses as a function of VA delay, calculated using a 200-ms neuronal response window. Proportions of enhanced andsuppressed multisensory units by stimulus, arranged with increasing VA delays (n = 66 visually modulated units). Note that the bars are spaced at equidistantintervals for display purposes, thus forming a discrete subsampling of the actual VA delay values (dots). Black dots indicate the proportion of enhanced re-sponses for each VA delay value, while respecting the real relative positions of VA delay values. The red line represents the sinusoid with the best-fittingfrequency (6.3 Hz, adjusted R2 = −0.094).

Fig. S5. Pattern of multisensory interactions as a function of VA delay in well-isolated single units. Proportions of enhanced and suppressed multisensory unitsby stimulus, arranged with increasing VA delays (n = 29 single unit activity).

Perrodin et al. www.pnas.org/cgi/content/short/1412817112 5 of 8

Page 12: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

Fig. S6. Pattern of multisensory interactions as a function of VA delay, calculated using a nonbinary multisensory metric. Additivity index values averagedacross nonlinear multisensory units by stimulus, arranged with increasing VA delays (n = 81 units). Shown is the mean ± SEM.

Perrodin et al. www.pnas.org/cgi/content/short/1412817112 6 of 8

Page 13: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

Fig. S7. Grand average broadband evoked potential and oscillatory context evoked by auditory and audiovisual stimulation. (A) Time course of broadband-evoked LFP in response to stimulation in the primary (auditory, A), nondominant (visual, V), and combined (audiovisual, AV) sensory conditions. The traces aresensory responses averaged across the four stimuli studied (two with long VA delays, two with midrange VA delays), and all recording sites that containedresponsive units. Time t = 0 indicates the onset of the stimulus in the relevant modality (sound onset for auditory and audiovisual modalities, and video onsetfor visual). Note: The different VA delays of individual vocalizations were compensated for when averaging auditory and audiovisual responses, so that vo-calization onsets are aligned across stimuli. In reality, the sound onset is shifted by t = 0 + VA delay for each vocalization. (B) Time-frequency plot of averagedsingle-trial spectrograms in response to auditory and audiovisual stimulation. The population-averaged spectrogram has been baseline-normalized for displaypurposes. (C) Time-frequency plot of average phase coherence values across trials. The color code reflects the strength of phase alignment evoked by theauditory and audiovisual stimuli. Black contours indicate the pixels with significant power or phase coherence increase, identified using a bootstrappingprocedure (right-tailed z test, P < 0.05, Bonferroni corrected).

Perrodin et al. www.pnas.org/cgi/content/short/1412817112 7 of 8

Page 14: Natural asynchronies in audiovisual communication signals regulate neuronal multisensory interactions in voice-sensitive cortex

Fig. S8. Voice- and face-specificity of the cross-modal phase-resetting and multisensory effects. (A) Time course of the phase-coherence increase in the 5- to10-Hz frequency band elicited by human and monkey faces (n = 52 sites). Shown is mean ± SEM. (B) Proportion of enhanced/suppressed units in response toaudiovisual pairs combining a face with an intact vocalization (coo and grunt) and to incongruent audiovisual pairs combining the face with a phase-scrambledversion of the original vocalizations. *P < 0.01, χ2 test.

Fig. S9. Positive values of cross-modal theta phase are linked to larger proportions of enhanced multisensory spiking responses. Proportion of enhanced andsuppressed multisensory units, ordered according to the value of the visually evoked theta (5–10 Hz) phase angle immediately before the sound onset (n = 14and n = 27 multisensory units for negative and positive phase angles, respectively). *P < 0.05, χ2 test.

Perrodin et al. www.pnas.org/cgi/content/short/1412817112 8 of 8