Neural Encoding of Speech in Auditory Cortex

Neural Encoding of Speech in Auditory Cortex

Jonathan Z. SimonDepartment of BiologyDepartment of Electrical & Computer EngineeringInstitute for Systems Research

University of Maryland

University College London, 22 June 2015http://www.isr.umd.edu/Labs/CSSL/simonlab

http://www.isr.umd.edu/Labs/CSSL/simonlab

AcknowledgementsCurrent (Simon Lab & Affiliates)

Francisco CervantesNatalia LapinskayaMahshid NajafiAlex PresaccoKrishna PuvvadaLisa UiblePeng Zan

Past (Simon Lab & Affiliate Labs)Nayef AhmarSahar AkramMurat AytekinClaudia BoninMaria ChaitMarisel Villafane Delgado Kim DrnecNai Ding Victor Grau-SerratJulian JenkinsDavid KleinLing Ma

Kai Sum LiHuan Luo Raul RodriguezBen WalshJuanjuan XiangJiachen Zhuo

CollaboratorsPamela AbshireSamira AndersonBehtash BabadiCatherine CarrMonita ChatterjeeAlain de CheveignéDidier DepireuxMounya ElhilaliBernhard EnglitzJonathan FritzCindy MossDavid PoeppelShihab Shamma

Funding NIH (NIDCD, NIA, NIBIB); USDA

Past Postdocs & Visitors Aline Gesualdi Manhães Dan HertzYadong Wang

Undergraduate Students Abdulaziz Al-Turki Nicholas AsendorfSonja BohrElizabeth CamengaCorinne CameronJulien DagenaisKatya DombrowskiKevin HoganKevin KahnAlexandria MillerIsidora RanovadovicAndrea ShomeMadeleine VarmerBen Walsh

Outline• Magnetoencephalography (MEG)

• Cortical Representations of Speech

- Encoding vs. Decoding

- Attended vs. Unattended Speech

• Work in Progress

- Attentional Dynamics

- Aging and the Cocktail Party Problem

- Foreground vs. Background

Magnetoencephalography• Non-invasive, Passive, Silent

Neural Recordings

• Simultaneous Whole-Head Recording (~200 sensors)

• Sensitivity• high: ~100 fT (10–13 Tesla)• low: ~104 – ~106 neurons

• Temporal Resolution: ~1 ms

• Spatial Resolution• coarse: ~1 cm• ambiguous

Neural Signals & MEG

tissue

CSF

skull

scalpB

MEG

VEEG

recordingsurface

currentflow

orientationof magneticfield

MagneticDipolarField

Projection

•Direct electrophysiological measurement•not hemodynamic•real-time

•No unique solution for distributed source

Photo by Fritz Goro

•Measures spatially synchronized cortical activity

•Fine temporal resolution (~ 1 ms)•Moderate spatial resolution (~ 1 cm)

Time Course of MEG Responses

BroadbandNoise

Auditory Evoked Responses

• MEG Response Patterns Time-Locked to Stimulus Events

• Robust

• Strongly Lateralized

Pure Tone

Component Analysis• Each component has both

spatial and temporal profile

• Data driven, e.g., PCA, ICA, DSS

• DSS: ordered by trial-to-trial reproducibility

• ➔ Spatial Filter, e.g. for single trials

• Can analyze temporal processing separately from anatomical origin Särelä & Valpola (2005)

de Cheveigné & Simon, J. Neurosci. Methods (2008)

MEG Responses

AuditoryModel

to Speech Modulations

Ding & Simon, J Neurophysiol (2012) “Spectro-Temporal Response Function”

(up to ~10 Hz)

MEG Responses Predicted by STRF Model

Linear Kernel = STRF

Ding & Simon, J Neurophysiol (2012)Zion-Golumbic et al., Neuron (2013)

Neural Reconstruction of Speech Envelope

2 s

stimulus speech envelopereconstructed stimulus speech envelope

Reconstruction accuracy comparable to single unit & ECoG recordings

(up to ~ 10 Hz)

MEG Responses

...

DecoderSpeech Envelope



2 s



(up to ~ 10 Hz)

MEG Responses

...




2 s



(up to ~ 10 Hz)

MEG Responses

...


Neural Representation of Speech: Temporal

Speech in Noise

Ding & Simon, J Neuroscience (2013)

Speech in Noise


Speech in Noise: Results

+6 dB

-6 dB1 s

A Neural Reconstruction ofUnderlying Speech Envelope

B Reconstruction Accuracy

corre

latio

n

SNR (dB)Q +6 +2 −3 −6 −9

0

.1

.2

0 25 50 75 1000

.1

.2

.3

C Correlation with Intelligiblity

intelligiblity (%)

reco

nstru

ctio

n ac

cura

cy



+6 dB

-6 dB1 s



corre

latio

n

SNR (dB)Q +6 +2 −3 −6 −9

0

.1

.2

0 25 50 75 1000

.1

.2

.3


intelligiblity (%)

reco

nstru

ctio

n ac

cura

cy



+6 dB

-6 dB1 s


+6 dB

-6 dB1 s



corre

latio

n

SNR (dB)Q +6 +2 −3 −6 −9

0

.1

.2

0 25 50 75 1000

.1

.2

.3


intelligiblity (%)

reco

nstru

ctio

n ac

cura

cy



+6 dB

-6 dB1 s


+6 dB

-6 dB1 s



corre

latio

n

SNR (dB)Q +6 +2 −3 −6 −9

0

.1

.2

0 25 50 75 1000

.1

.2

.3


intelligiblity (%)

reco

nstru

ctio

n ac

cura

cy



+6 dB

-6 dB1 s


+6 dB

-6 dB1 s



corre

latio

n

SNR (dB)Q +6 +2 −3 −6 −9

0

.1

.2

0 25 50 75 1000

.1

.2

.3


intelligiblity (%)

reco

nstru

ctio

n ac

cura

cy

+6 dB

-6 dB1 s



corre

latio

n

SNR (dB)Q +6 +2 −3 −6 −9

0

.1

.2

0 25 50 75 1000

.1

.2

.3


intelligiblity (%)

reco

nstru

ctio

n ac

cura

cyacross Subjects


Noise-Vocoded Speech

Ding, Chatterjee & Simon, NeuroImage (2014)

“in noise” = +3 dB SNR

Noise-Vocoded Speech: Results

• Cortical entrainment to natural speech robust to noise• Cortical entrainment to vocoded speech is not• Not explainable by passive envelope tracking mechanisms

- noise vocoding does not directly affect the stimulus envelope

Noise-Vocoded Speech: Results

Cortical Speech Representations

• Neural Representations: Encoding & Decoding

• Linear models: Useful & Robust

• Speech Envelope only (as seen by MEG)

• Envelope Rates: ~ 1 - 10 Hz

Alex Katz, The Cocktail Party

The Cocktail Party


The Cocktail Party


The Cocktail Party


The Cocktail Party


The Cocktail Party

speech

competing speech

Experiments

speech

competing speech

Experiments

reverberation

Experiments in Progress

speech

competing speech


olderlistener

speech

competing speech


competing speech

speech

competing speech

Two Competing Speakers

Selective Neural Encoding



Unselective vs. Selective Neural Encoding

Unselective vs. Selective Neural Encoding


Stream-Specific Representation

grand average over subjects

representative subject

Identical Stimuli!

reconstructed from MEG

attended speech envelopes


attending tospeaker 1


Ding & Simon, PNAS (2012)

Stream-Specific Representation

grand average over subjects

representative subject

Identical Stimuli!


attended speech envelopes





Single Trial Speech Reconstruction


Single Trial Speech Reconstruction


Overall Speech Reconstruction

0.2

0

0.1

corre

latio

n

attended speechreconstruction

backgroundreconstruction

attended speech background

Distinct neural representations for different speech streams

Invariance Under Relative Loudness Change?

Invariance Under Relative Loudness Change?

Invariance under Relative Loudness Change

attended

backgroundcorr

elat

ion

.1

.2

-8 -5 0 5 8Speaker Relative Intensity (dB)

Neural Results

• Neural representation invariant to relative loudness change

• Stream-based Gain Control, not stimulus-based

Forward STRF Model

Spectro-Temporal Response Function (STRF)

Forward STRF Model

Spectro-Temporal Response Function (STRF)

STRF Results

•STRF separable (time, frequency)•300 Hz - 2 kHz dominant carriers•M50STRF positive peak•M100STRF negative peak

TRF

•M100STRF strongly modulated by attention, but not M50STRF

attended

.2

.5

1

3

0 100 200

Background

fre

qu

en

cy (

kH

z)

.2

.5

1

3

0 100 200

Attended

time (ms) time (ms)

background

STRF Results


TRF


attended

.2

.5

1

3

0 100 200

Background

fre

qu

en

cy (

kH

z)

.2

.5

1

3

0 100 200

Attended

time (ms) time (ms)

background

STRF Results


TRF


attended

.2

.5

1

3

0 100 200

Background

fre

qu

en

cy (

kH

z)

.2

.5

1

3

0 100 200

Attended

time (ms) time (ms)

background

Neural Sources

RightLeft

anterior

posterior

medial

M50STRFM100STRFM100

•M100STRF source near (same as?) M100 source: Planum Temporale

•M50STRF source is anterior and medial to M100 (same as M50?): Heschl’s Gyrus

5 mm

•PT strongly modulated by attention, but not HG

Cortical Object-Processing Hierarchy

0 100 200 400time (ms)

0

attendedbackground

Attentional Modulation

0 100 200 400

0

time (ms)

clean

-5 dB-8 dB

Influence of Relative Intensity

0 dB5 dB8 dB

•M100STRF strongly modulated by attention, but not M50STRF.•M100STRF invariant against acoustic changes.•Objects well-neurally represented at 100 ms, but not 50 ms.

Studies In Progress

• Attentional Dynamics

• Aging & Neural Representations of Speech

• Neural Representations of the Background

Attentional DynamicsAttend to Speaker 1

Attend to Speaker 2

Prob

abilit

y of

atte

ndin

g S

peak

er 1

Attentional DynamicsAttend to Speaker 1

Switch Attention

Attend to Speaker 2

Prob

abilit

y of

atte

ndin

g S

peak

er 1

Younger vs. Older Listeners

withCompetingSpeaker

Spee

ch R

econ

stru

ctio

n


Older AdultsYounger Adults

In Quiet

Integration window (ms)

In Quiet

Younger vs. Older Listeners


Spee

ch R

econ

stru

ctio

n


Older AdultsYounger Adults

In Quiet

Integration window (ms)

In Quiet

speech

competing speech

Three Competing Speakers

competing speech

Foreground vs. Background







−0.2

0

0.2

0.4

Noise FloorIndividual BackgroundsFused BackgroundForeground

reco

nstru

ctio

n ac

cura

cy

Individual Speech Streams

Backgrounds vs. Background

NoiseFloor

ForegroundIndividualBackgrounds

FusedBackground


Integration Window over Late Times Only

−0.2

0

0.2

0.4


reco

nstru

ctio

n ac

cura

cy



NoiseFloor

IndividualBackgrounds

FusedBackground

Foreground



−0.2

0

0.2

0.4


reco

nstru

ctio

n ac

cura

cy



NoiseFloor


FusedBackground

Foreground



Backgrounds vs. BackgroundWhy not?

Speaker 1

Two Speakers

Speaker 2

Stimulus Background

MEG Response


Speaker 1

Two Speakers

Speaker 2

Stimulus Background

MEG Response


Speaker 1

Two Speakers

Speaker 2

Stimulus Background

MEG Response


Speaker 1

Two Speakers

Speaker 2

Stimulus Background

MEG Response

−0.2

0

0.2

0.4


reco

nstru

ctio

n ac

cura

cy



NoiseFloor


FusedBackground

Foreground



−0.2

0

0.2

0.4


reco

nstru

ctio

n ac

cura

cy



NoiseFloor


FusedBackground

Foreground




−0.2

0

0.2

0.4

Noise FloorSummed Individual BackgroundsFused BackgroundForeground

reco

nstru

ctio

n ac

cura

cy

Individual Speech Streams II

NoiseFloor


Summed

FusedBackground

Foreground




−0.2

0

0.2

0.4


reco

nstru

ctio

n ac

cura

cy


NoiseFloor


Summed

FusedBackground

Foreground


+

+



−0.2

0

0.2

0.4


reco

nstru

ctio

n ac

cura

cy


NoiseFloor


Summed

FusedBackground

Foreground


+

+



−0.2

0

0.2

0.4


reco

nstru

ctio

n ac

cura

cy


NoiseFloor


Summed

FusedBackground

Foreground


?

+

+



−0.2

0

0.2

0.4


reco

nstru

ctio

n ac

cura

cy


NoiseFloor


Summed

FusedBackground

Foreground


?

+

+



Individual Backgrounds Summed

Fuse

d Ba

ckgr

ound

0 0.1 0.2 0.3

0

0.1

0.2

0.3

Background Representations

individual backgrounds reconstruction

fuse

d ba

ckgr

ound

reco

nstru

ctio

n

+




Fuse

d Ba

ckgr

ound

0 0.1 0.2 0.3

0

0.1

0.2

0.3



fuse

d ba

ckgr

ound

reco

nstru

ctio

n

+




Fuse

d Ba

ckgr

ound

0 0.1 0.2 0.3

0

0.1

0.2

0.3



fuse

d ba

ckgr

ound

reco

nstru

ctio

n

+

High latency areas (PT) represent fused background with better fidelity than individual backgrounds(p = 0.012)



−0.2

0

0.2

0.4


reco

nstru

ctio

n ac

cura

cy


ForegroundIndividualBackgrounds



Foreground vs. BackgroundEarly vs. Late

0 0.1 0.2 0.3−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35Core Representations

backgrounds reconstruction

fore

grou

nd re

cons

truct

ion

0 0.1 0.2 0.3−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35Higher Order Representations


fore

grou

nd re

cons

truct

ion

p > 0.10 p < 1e-11

Early (HG) Late (PT)

Background Background

Fore

grou

nd


0 0.1 0.2 0.3−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3



fore

grou

nd re

cons

truct

ion

0 0.1 0.2 0.3−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3



fore

grou

nd re

cons

truct

ion

p > 0.10 p < 1e-11



Fore

grou

nd


0 0.1 0.2 0.3−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3



fore

grou

nd re

cons

truct

ion

0 0.1 0.2 0.3−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3



fore

grou

nd re

cons

truct

ion

p > 0.10 p < 1e-11


HG represents attended and unattended speech with almost equal fidelity


Fore

grou

nd

Summary• Cortical representations of speech- representation of envelope (up to ~10 Hz)

• Cortical Processing Hierarchy: Consistent with being neural representation of auditory perceptual object

• Object representation at 100 ms latency (PT), but not by 50 ms (HG)

• Preliminary evidence for - PT: additional fused background representation

- HG: almost equal representations

Thank You

Neural Encoding of Speech in Auditory Cortex

Documents