Parametrization of Datasets with Low-distortion Embeddings François Meyer Center for the Study of Brain, Mind and Behavior, Program in Applied and Computational Mathematics Princeton University [email protected]http://www.princeton.edu/∼fmeyer Random Shapes, Tutorials, IPAM 2007 François Meyer (Princeton) March 14, 2007 1 / 41
46
Embed
Parametrization of Datasets with Low-distortion Embeddingshelper.ipam.ucla.edu/publications/rstut/rstut_6961.pdf · Parametrization of Datasets with Low-distortion Embeddings François
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Parametrization of Datasets withLow-distortion Embeddings
François Meyer
Center for the Study of Brain, Mind and Behavior,Program in Applied and Computational Mathematics
1 IntroductionAssumptions about the dataOur definition of the problem
2 Constructing a new parametrizationA random walk on the datasetA new way to measure distances
3 The spectral connectionSpectral graph theoryFrom commute time to spectral geometry
4 Classification of EEG recordings
François Meyer (Princeton) March 14, 2007 3 / 41
Introduction
Explosion of high-dimensional datasets:web, biology, medicine, etc.New tools for data exploration and analysis address issues:
I signal of interest: complicated geometryI data corrupted by noise,I algorithm complexity
François Meyer (Princeton) March 14, 2007 4 / 41
Example 1: Neuroimaging
dogma: each brain region is responsible for a specific function
goal: delineation of functional anatomy in terms of spatialand temporal organizationmethod:
I very simple cognitive or sensory input stimulusI measure the output signal xi at each voxel i inside the brainI detect significant changes in the signal
François Meyer (Princeton) March 14, 2007 5 / 41
Example 1: fMRI of Natural Stimuli
challenge: study the response to complex stimuli (“real life”)
example: subject watches a movie in the MRI scanner
discover neuronal networks involved in complex tasks
how is the analysis performed ?
size of the problem: 200,000 time series in R1000
François Meyer (Princeton) March 14, 2007 6 / 41
Example 2: Prediction of seizure from EEG
Electroencephalogram: electrical recordings on the scalp
seizure → time-frequency changes in the signal
goal: predict the seizure before the onset
best existing method: brain = nonlinear dynamical system
RER: au delà de cette limite, en direction de la banlieue, la tarification dépend de la distance. Les tickets Métro Urbain ne sont pas valables.
Correspondances
Fin de lignes en correspondance
Correspondance Métro–RER ou Métro–SNCF avec trajet par la voie publique
Pôle d’échange multimodal, métro, RER, tramway
Légende
Meudonsur-Seine
Parcde St-Cloud
Les Coteaux
Les Milons
SuresnesLongchamp
Belvédère
Puteaux
Muséede Sèvres
Brimborion LesMoulineaux
Hôtel de Villede La Courneuve
HôpitalDelafontaine Cosmonautes
La Courneuve6 Routes
Cimetièrede St-Denis
Marchéde St-Denis
Basilique deSt-Denis
ThéâtreGérard Philipe
Hôtel de Villede Bobigny
La Ferme
Libération
Escadrille Normandie–Niémen
Auguste Delaune
Pontde Bondy
PetitNoisy
Jean Rostand
Gaston Roulaud
Hôpital Avicenne
La Courneuve–8 Mai 1945
Maurice Lachâtre
Danton
Stade Géo André
Drancy–Avenir
T
ram
wa
y –
ta
rifi
cati
on
bu
s
Tra
mw
ay
– t
ari
fica
tio
n b
us
T
ram
w
ay –
tari
f ica
tion b
us
Tramway –
t
arif i
ca
tion b
us Jacques-HenriLartigue
Tramway - tari f ication bus
Tramw
ay - tarif ication bus
Poternedes Peupliers
StadeCharléty
Montsouris
JeanMoulin
DidotBrancion
Desnouettes
GeorgesBrassens
ColonelFabien
Stade de FranceSaint-Denis
Saint-DenisPorte de Paris
Basiliquede St-Denis
Portede Saint-Ouen
LamarckCaulaincourt
JulesJoffrin
Saint-Denis
GuyMôquet
Porte de Clichy
ChâteletLes Halles
Stalingrad
MarcadetPoissonniers
Hôtel de Ville
Arts etMétiers
ChâteauRouge
StrasbourgSaint-Denis
Saint-Mandé
Maisons-AlfortAlfortville
Hoche
MarxDormoy
La CourneuveAubervilliers
Le Bourget
Jaurès
Placedes FêtesBelleville
Jourdain Télégraphe
Ourcq Portede Pantin
JacquesBonsergent Saint-Fargeau
Pelleport
Portede Bagnolet
Danube
BotzarisButtes
Chaumont
République
Parmentier
SèvresBabylone
RichelieuDrouot
Palais RoyalMusée du
Louvre
RéaumurSébastopol
Raspail
Pyramides
MontparnasseBienvenüe
Mabillon
ClunyLa Sorbonne
MalakoffRue Étienne Dolet
Chausséed’Antin
La Fayette
BonneNouvelle
GrandsBoulevards
Bourse Sentier
BarbèsRochechouart La Chapelle
Pigalle
Poissonnière
Liège
Notre-Damede-Lorette
Saint-Georges
Abbesses
Anvers
La Fourche
Placede Clichy
Pereire–Levallois Blanche
Villiers
Opéra
Auber
HavreCaumartin
Trinitéd’Estienne
d’Orves
FranklinD. Roosevelt
Neuilly–Porte Maillot
Avenue Foch
Bir-Hakeim
Invalides
Pasteur
Pontde l’Alma
JavelAndréCitroën
Javel
MichelAnge
Molitor
La Muette
MichelAnge
Auteuil
BoulainvilliersChamp de MarsTour Eiffel
ChampsÉlysées
Clemenceau
TrocadéroAvenue
Henri Martin
Avenuedu Pdt Kennedy
Commerce
Félix Faure
Ruede la Pompe Iéna
Ranelagh
Jasmin
Exelmans
ChardonLagache
Églised’Auteuil
Duroc
La MottePicquetGrenelle
Assemblée Nationale
Varenne
St-MichelNotre-DameSolférino
Musée d’Orsay
Concorde Les Halles
Jussieu
Odéon
ÉtienneMarcel
Vavin
SaintGermaindes-Prés
Saint-Sulpice
St-Placide
PlaceMonge
Pont NeufTuileries
Notre-Damedes-Champs
Luxembourg
Port-Royal
LouvreRivoli
St-Michel
Bastille
Daumesnil
Reuilly–Diderot
PèreLachaise
Oberkampf
Quai dela Rapée
SaintMarcel Bercy
Fillesdu Calvaire
Chemin Vert
St-SébastienFroissart
Ruedes Boulets
Charonne
SèvresLecourbe
Cambronne
Pont Marie
SullyMorland
PhilippeAuguste
AlexandreDumas
Avron
Voltaire
QuatreSeptembre
Saint-Ambroise
RueSaint-Maur
Porte deVincennes
Vincennes
Maraîchers
Porte de Montreuil
Robespierre
Croix de Chavaux
Buzenval
Pyrénées
Laumière
Châteaud’Eau
Porte Dorée
Porte de Charenton
Ivrysur-Seine
Portede Choisy
Ported’Italie
Ported’Ivry
Créteil–Université
Créteil–L’Échat
Maisons-AlfortLes Juilliottes
Maisons-Alfort–Stade
DenfertRochereau
MaisonBlanche
CensierDaubenton
Tolbiac
Le KremlinBicêtre
VillejuifLéo Lagrange
VillejuifPaul Vaillant-Couturier
Glacière
Corvisart
Nationale
Chevaleret
CitéUniversitaire
Gentilly
OrlyOuestAntony
MalakoffPlateau de Vanves
MoutonDuvernet
Gaîté
EdgarQuinet
Bd Victor
Issy
Meudon–Val-Fleury
Chaville–Vélizy
Portede St-Cloud
Pantin
Magenta
LaplaceVitrysur-Seine
Les Ardoines
Le Vertde Maisons
Saint-Ouen
Les Grésillons
Ledru-Rollin
Ménilmontant
Couronnes
Montgallet
Michel Bizot
Charenton–Écoles
Liberté
FaidherbeChaligny
École Vétérinairede Maisons-Alfort
St-Paul
MaubertMutualité
CardinalLemoine
Temple
ChâteauLandon
Bolivar
Église de Pantin
Bobigny–PantinRaymond Queneau
Riquet
Crimée
Corentin Cariou
Porte de la Villette
Aubervilliers–PantinQuatre Chemins
Fortd’Aubervilliers
LesGobelins
CampoFormio
Quaide la Gare
CourSt-Émilion
Pierre Curie
Dugommier
Bel-Air
Picpus
Saint-Jacques
Dupleix
Passy
Alésia
Pernety
Plaisance
Porte de Vanves
Convention
Vaugirard
Porte de Versailles
Corentin Celton
Volontaires
Falguière
Lourmel
Boucicaut
Ségur
Ruedu Bac
Rennes
SaintFrançoisXavier
Vaneau
ÉcoleMilitaire
La TourMaubourg
AvenueÉmile Zola
CharlesMichels
MirabeauPorte
d’AuteuilBoulogne
Jean Jaurès
Billancourt
Marcel Sembat
Cité
AlmaMarceau
Boissière
KléberGeorge V
Argentine
Victor Hugo
Les Sablons
Porte Maillot
Pont de Neuilly
Esplanadede La Défense
Saint-Philippedu-Roule
Miromesnil
Saint-Augustin
Courcelles
Ternes
Monceau
RomeMalesherbes
Wagram
Porte de Champerret
Anatole France
Louise Michel
Europe
Le Peletier
Cadet
Pereire
Brochant
Mairiede Clichy
Garibaldi
Mairie de Saint-Ouen
CarrefourPleyel
Simplon
Bérault
RichardLenoir
Goncourt
BréguetSabin
Rambuteau
La PlaineStade de France
Arcueil–Cachan
Bourg-la-Reine
Bagneux
Madeleine
Portede Clignancourt
Garede Saint-Denis
Orry-la-Ville–Coye
Garede l’Est
Garede Lyon
Saint-Denis–Université
Portede la Chapelle
La Courneuve8 Mai 1945
AéroportCharles de Gaulle
Mitry–Claye
BobignyPablo Picasso
PréSt-Gervais Mairie des Lilas
Mairiede Montreuil
Gallieni
LouisBlanc
Porte des Lilas
Gambetta
Gare du Nord
Charlesde Gaulle
Étoile
Pont de LevalloisBécon
PorteDauphine
St-Germainen-Laye La Défense
BoulognePont de St-Cloud
Châtelet
Gared’Austerlitz
Nation
Châteaude Vincennes
Créteil–Préfecture
Olympiades
Mairie d’Ivry
Malesherbes Melun
Massy–PalaiseauVersailles–ChantiersRobinson
DourdanSaint-Martin-d’Étampes
Saint-Rémylès-Chevreuse
Placed’Italie
Villejuif–Louis Aragon
OrlySud
Mairie d’Issy
Châtillon–Montrouge
Pont de Sèvres IssyVal de Seine
Gare Saint-Lazare
GareMontparnasse
ChellesGournay
Noisy-le-Sec
HaussmannSaint-Lazare
Gabriel PériAsnières–Gennevilliers
Saint-Quentin-en-Yvelines
Ported’Orléans
Balard
Versailles–Rive Gauche
Marne-la-Vallée
Boissy-Saint-Léger
Pontoise
Poissy
Cergy
Saint-Lazare
Tournan
Pontdu Garigliano
BibliothèqueFrançois Mitterrand
Avril 2007
Parcs Disneyland
CDG
Château de Versailles
Grande Arche
Orly
Paris
www.ratp.fr
Pro
prié
té d
e la
RAT
P -
Age
nce
Car
togr
aph
iqu
e - P
M1
07.2
006-
015
-C.C
C -
Des
ign
: bdc
con
seil
01 5
3 02
02
20 -
Rep
rodu
ctio
n in
terd
ite
François Meyer (Princeton) March 14, 2007 19 / 41
How does κ(i , j ) compare to δ(i , j ) ?
κ(i , j ) can be compared to the standard distance δ on the graph
TheoremIf i and j are at a distance δ(i , j ) on the graph, then
2δ(i , j ) 6 κ(i , j ) 6 Cδ(i , j ),
where C = maxi ,j1
πiPi,j=
∑i,j wi,j
mini,j wi,j
Markov chain is reversible, πiPi ,j = πjPj ,i
C can be large
François Meyer (Princeton) March 14, 2007 20 / 41
Your worst commute time in L.A.: Sepulveda Blvd or 405 ?
François Meyer (Princeton) March 14, 2007 21 / 41
Maximum commute time: lost in the city...
among all graphs with N vertices,what is the graph with the largest κ(i , j ) ?
lollipop graph: path with (N − 1)/3 vertices,complete subgraph with (2N + 1)/3 vertices
(N−1)/3
i
j
(2N+1)/3
κ(i , j ) = 427N
3 + O(N )
δ(i , j ) = 23N , C = 2N (2N + 1)/18
[Jonasson, 2000]
François Meyer (Princeton) March 14, 2007 22 / 41
Maximum commute time: lost in the city...
among all graphs with N vertices,what is the graph with the largest κ(i , j ) ?
lollipop graph: path with (N − 1)/3 vertices,complete subgraph with (2N + 1)/3 vertices
(N−1)/3
i
j
(2N+1)/3
κ(i , j ) = 427N
3 + O(N )
δ(i , j ) = 23N , C = 2N (2N + 1)/18
[Jonasson, 2000]
François Meyer (Princeton) March 14, 2007 22 / 41
Maximum commute time: lost in the city...
among all graphs with N vertices,what is the graph with the largest κ(i , j ) ?
lollipop graph: path with (N − 1)/3 vertices,complete subgraph with (2N + 1)/3 vertices
(N−1)/3
i
j
(2N+1)/3
κ(i , j ) = 427N
3 + O(N )
δ(i , j ) = 23N , C = 2N (2N + 1)/18
[Jonasson, 2000]
François Meyer (Princeton) March 14, 2007 22 / 41
Maximum commute time: lost in the city...
among all graphs with N vertices,what is the graph with the largest κ(i , j ) ?
lollipop graph: path with (N − 1)/3 vertices,complete subgraph with (2N + 1)/3 vertices
(N−1)/3
i
j
(2N+1)/3
κ(i , j ) = 427N
3 + O(N )
δ(i , j ) = 23N , C = 2N (2N + 1)/18
[Jonasson, 2000]
François Meyer (Princeton) March 14, 2007 22 / 41
Outline
1 IntroductionAssumptions about the dataOur definition of the problem
2 Constructing a new parametrizationA random walk on the datasetA new way to measure distances
3 The spectral connectionSpectral graph theoryFrom commute time to spectral geometry
4 Classification of EEG recordings
François Meyer (Princeton) March 14, 2007 23 / 41
The spectral connection
Fundamental matrix Z = (I − (P − Π))−1 = I +∑
k>1 Pk − Π
with ΠT = [π1| · · · |πN ]
Z is the Green function of the Laplacian, I − L
Theorem[Bremaud, 1999] Hitting time Ei [Tj ] = (Zj ,j − Zi ,j )/πj .
Ei [Tj ] = 1 +∑
k ;k 6=j Pi ,kEi [Tk ]
eigenfunctions φ1, · · · , φN of
D12 PD− 1
2 , (2)
with eigenvalues −1 6 λN · · · 6 λ2 < λ1 = 1.
François Meyer (Princeton) March 14, 2007 24 / 41
The spectral connection
commute time:
κ(i , j ) =
N∑k=2
11 − λk
(φk (i)√πi
−φk (j )√πj
)2
. (3)
define an embedding
i 7→ Ik (i) =1√
1 − λk
φk (i)√πi
, k = 2, · · · , N (4)
Euclidean distance on the image of the embedding= commute time
κ(i , j ) = ‖I (i) − I (j )‖2 =
N∑k=2
|Ik (i) − Ik (j )|2
François Meyer (Princeton) March 14, 2007 25 / 41
Outline
1 IntroductionAssumptions about the dataOur definition of the problem
2 Constructing a new parametrizationA random walk on the datasetA new way to measure distances
3 The spectral connectionSpectral graph theoryFrom commute time to spectral geometry
4 Classification of EEG recordings
François Meyer (Princeton) March 14, 2007 26 / 41
The Laplacian connection
φk is also an eigenfunction of the Laplacian
L = I − D12 PD− 1
2 ,
with the eigenvalues βk = 1 − λk .
φk minimizes the “distortion”
minφ,‖φ‖=1
∑[i ,j ] wi ,j (φ(i) − φ(j ))2∑
i diφ2(i)
with φk orthogonal to {φ0, φ1, · · · , φk−1}.
Laplacian eigenmaps [Belkin and Niyogi, 2003]
François Meyer (Princeton) March 14, 2007 27 / 41
The diffusion distance connection
[Lafon, 2004, Coifman and Lafon, 2006]
diffusion distance,
D2t (i , j ) =
N∑k=2
λ2tk
(φk (i)√πi
−φk (j )√πj
)2
. (5)
commute time = sum of the diffusion distance at all scale t
∞∑t=0
D2t/2(i , j ) =
N∑k=2
11 − λk
(φk (i)√πi
−φk (j )√πj
)2
= κ(i , j )
François Meyer (Princeton) March 14, 2007 28 / 41
The spectral geometry connection
data sampled on a n-dimensional manifold M
M embedded by its heat kernel KM(t , x , y)
Theorem[Bérard et al., 1994]
ψt :M 7→ l2(R) (6)
x 7→{√
2(4π)n/4t(n+2)/4e−λj t/2φk (x )}
k>1(7)
∀t > 0, the map ψt is an embedding of M into l2(R).
scale parameter t : similar to diffusion distance
François Meyer (Princeton) March 14, 2007 29 / 41
The spectral geometry connection
ψt : composition of1 embedding of M by the heat kernel:
each point on M is mapped to a bump function.
M 7→ L2(M) (8)
x 7→ KM(t/2, x , .) (9)
2 isometry given by the choice of basis, {Φ1,Φ2, · · · }, of L2(M),each function of L2(M) is expanded into the basis ofeigenfunctions of the Laplace-Beltrami operator
L2(M) 7→ l2(R) (10)
f 7→ {< f ,φk >}k>1 (11)
François Meyer (Princeton) March 14, 2007 30 / 41
Algorithm 1: Construction of the embedding
Input:I xi (t), t = 0, · · · , T − 1, i = 1, · · · , N ,I σ ; nn number of nearest neighbors.I K : number of eigenfunctions.
Algorithm:1 construct the graph defined by the nn nearest (according to‖xi − xj ‖) neighbors of each xi
2 compute P3 find the first K eigenfunctions, φk , of D
12 PD− 1
2
Output: For all xi
I new co-ordinates of xi :{
11−λk
φk (i)√πi
}, k = 2, · · · , N
François Meyer (Princeton) March 14, 2007 31 / 41
A toy example
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
−1.5
−1
−0.5
0
0.5
1
1.5
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
−0.03−0.02
−0.010
0.010.02
0.03
−0.04
−0.03
−0.02
−0.01
0
0.01
0.02
0.03
The 2nd EigenVector
The 3rd EigenVector
The
4th
Eig
enV
ecto
r
François Meyer (Princeton) March 14, 2007 32 / 41
Classification of EEG recordings
classification of EEG recordings into baseline and ictal states
Hypothesis: we can find a lower dimensional representation forclassification
More details can be found in[Ramírez-Vélez et al., 2006, Meyer and Shen, 2007]
François Meyer (Princeton) March 14, 2007 33 / 41
Human dataset
scalp electroencephalograms
55 electrode channels, lowpass filtered at 256 Hz.
baseline, pre-ictal, ictal,and post-ictal time segments
Eigensolvers for large (N = 105 − 106) sparse matrices
François Meyer (Princeton) March 14, 2007 40 / 41
Open questions
much faster eigensolvers are needed...Matlab blows up forN > 10, 000
real time update φk with new incoming data ?
φk : sensitive to σ and nn
how many new co-ordinates (local dimension) ?
François Meyer (Princeton) March 14, 2007 41 / 41
Belkin, M. and Niyogi, P. (2003).Laplacian eigenmaps for dimensionality reduction and datarepresentation.Neural Computations, 15:1373–1396.
Bérard, P., Besson, G., and Gallot, S. (1994).Embeddings Riemannian manifolds by their heat kernel.Geometric and Functional Analysis, 4(4):373–398.
Bremaud, P. (1999).Markov Chains.Springer Verlag.
Coifman, R. and Lafon, S. (2006).Diffusion maps.Applied and Computational Harmonic Analysis, 21:5–30.
Jonasson, J. (2000).Lollipop graphs are extremal for commute times.Random Structures and Algorithms, 16(2):131–142.
François Meyer (Princeton) March 14, 2007 41 / 41
Lafon, S. (2004).Diffusion maps and geometric harmonics.PhD thesis, Yale University, New Haven.
Meyer, F. and Shen, X. (2007).Exploration of high dimensional biomedical datasets withlow-distortion embedding.In Proc. Data Mining for Biomedical Informatics Workshop,7th SIAM International Conference on Data Mining.To appear.
Ramírez-Vélez, M., Staba, R., Barth, D., and Meyer, F. (2006).Nonlinear classification of EEG data for seizure detection.In Proc. IEEE International Symposium on BiomedicalImaging: Macro to Nano, pages 956–959.