Top Banner
How to Visualize High-Dimensional Data? (Also: How to make 2000 bucks in an hour?) Laurens van der Maaten
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Talk data sciencemeetup

How to Visualize High-Dimensional Data?(Also: How to make 2000 bucks in an hour?)

Laurens van der Maaten

Page 2: Talk data sciencemeetup

Visualization

• Visualization is a key tool in the analysis of data

Works for low-dimensional data only!

Page 3: Talk data sciencemeetup

Data visualization

• What can we do to visualize Big Data that has lots of variables?

• Make a scatter plot in which each point corresponds to a measurement

• Arrange the points such that nearby points model similar measurements

• How do we determine the locations of the points in the map?

• Techniques for dimension reduction, multidimensional scaling, or embedding

Page 4: Talk data sciencemeetup

Embedding

Embedding

Page 5: Talk data sciencemeetup

Embedding

• The input of an embedding algorithm is:

• Collection of high-dimensional data points or...

• Collection of pairwise (dis)similarities (a distance table)

• The output of an embedding algorithm is:

• Collection of low-dimensional data points (a map)

Page 6: Talk data sciencemeetup

• Principal Components Analysis maps the data in a linear subspace, such that the variance of the projected data is maximized:

Principal components analysis

w

Tx

Page 7: Talk data sciencemeetup

Principal components analysis

Page 8: Talk data sciencemeetup

Principal components analysis

Page 9: Talk data sciencemeetup

t-Distributed Stochastic Neighbor Embedding

• Measure pairwise similarities between high-dimensional objects:

pij =exp(�kxi � xjk2/2�2

)Pk

Pl 6=k exp(�kxk � xlk2/2�2

)

High-D

Page 10: Talk data sciencemeetup

t-Distributed Stochastic Neighbor Embedding

• Move points around to minimize: KL(P ||Q) =�

i

j �=i

pij logpij

qij

qij =(1 + �yi � yj�2)�1

�k

�l �=k(1 + �yk � yl�2)�1

Low-D

Page 11: Talk data sciencemeetup

t-Distributed Stochastic Neighbor Embedding

0123456789

van der Maaten & Hinton, 2008

Page 12: Talk data sciencemeetup
Page 13: Talk data sciencemeetup
Page 14: Talk data sciencemeetup

Scaling up t-SNE

• Interpret evaluating t-SNE gradient as simulating an N-body system

• Use a Barnes-Hut algorithm to approximate t-SNE gradient in O(N logN)

Page 15: Talk data sciencemeetup

0123456789

Scaling up t-SNE

• Scale up t-SNE to large data sets (MNIST, N = 70K; T = 10m):

van der Maaten, 2013

Page 16: Talk data sciencemeetup

Scaling up t-SNE

• Even to data sets with millions of data points (TIMIT, N = 1.1M; T = 3h 40m):

Page 17: Talk data sciencemeetup

So how did you win 2000 bucks in an hour?

• Kaggle and Merck hosted a molecular activity visualization challenge:

• Features derived from molecules’ chemical structure

• Each molecule also has an activity value

• The data distribution somehow changes over time

• Visualize features using t-SNE, and color according to activity and time

Page 18: Talk data sciencemeetup

Merck visualization (1)

Data set #8 colored by activity

5.5

6

6.5

7

7.5

8

8.5

9

9.5

10

Data set #8 colored by time

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 19: Talk data sciencemeetup

Merck visualization (2)

Data set #8 colored by time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Data set #8 colored by activity

5.5

6

6.5

7

7.5

8

8.5

9

9.5

10

Page 20: Talk data sciencemeetup
Page 21: Talk data sciencemeetup

Limitations of using a single map

• Suppose we are visualizing words based on association data, or authors based on co-authorships, or Enron emails, or scale-free networks, etc.

• How can we model the words “river”, “bank”, and “bailout” in a single map?

RIVER

BANK

BAILOUT

Page 22: Talk data sciencemeetup

Multiple maps t-SNE

• Construct multiple maps, and give each object a point in each map

• Assign an importance weight to each point

• Define the similarity between two points under the multiple maps model as a weighted sum over the similarities in the individual maps

van der Maaten & Hinton, MLJ 2012

Map 1 Map 2

1

½

1

½

RIVER

BANK

BAILOUT

BANK

Page 23: Talk data sciencemeetup

Multiple maps t-SNE

• Definition of similarity under multiple maps model:

• Herein, we define the importance weights as:

• All map coordinates and importance weights are learned jointly

qj|i =

Pm �(m)

i �(m)j (1 + ky(m)

i � y(m)j k2)�1

Pm0

Pk ⇥=i �

(m0)i �(m0)

k (1 + ky(m0)i � y(m0)

k k2)�1

van der Maaten & Hinton, 2012

�(m)i =

exp(w(m)i )

�m� exp(w(m�)

i )

Page 24: Talk data sciencemeetup

CHEERLEADER

MOLE

RESPECT

CONGRESS

FREEDOM

GROW

KIDS

PROCEDURE

PUBERTY

CROOKED

GROWTH PINK

UNSURE

WALKER WORM

CURVED

PARENTS

RESPONSIBILITY

UNEVEN

WART

BUREAU

FRESH HERITAGE

INSURANCERULE

SEXY

AGE

ANARCHY

CONFIDENT

CURVE

DINOSAUR

DISGUST

MAYOR

MODEL

OFFICIAL

PLAY DOUGHCARRY

GODDESS

GOVERNMENT

INSTRUCTION

LOOKS

PRESIDENT

ANCESTOR

CUTE

DEVICE

DIRECTIONS

FEDERAL

GOO

LINE

OBEY

SENATOR

TANGENT

VULGAR

ADULTS

BAG

BEAUTY

CHILDREN

DEODORANT

ERECT

GORGEOUS

GRANDPARENTS

INSTRUCTIONS

KNITTING

LUNCH

MAGGOT

POLITICIAN

REPUBLIC

RUST

APPEARANCEATTRACTIVE

AWKWARD

BORDERCONSTITUTION

DEMOCRACY

DEVELOP

EGYPT

GROSS

GROWN

KNAPSACK

LAWS

PRINCIPLE

SLIMYTARNISH

BEAST

CAMPAIGN

IMMATUREMATURE

MODERN

SLIMESLUG

TAXES

WRINKLEYUCK

BENT

CANE

CORRUPT

DISGUSTING

LAW

MONARCHYOLIGARCHY

POLICY

RESTRICTION

SENATE

STALE

UGLY

AMERICABOUNDARY

BUGLE

FOSSIL

GOVERNOR

HANDSOME

LEGISLATUREPOLITICS

REPULSIVE

SURE

UNUSEDWORN

ADULT

ANCIENT

ELDERS

GAL

NASTY

RULESSTRAIGHT

USED

YEARS

DEMOCRAT

FEEBLE

FOLLOW

GRANDPA

PRETTY

USA

ANTIQUE BALD

BOY

CERTAIN

GRANDMA

NEW

SCOUT

WISEELDERLY

OLD

POSITIVE

REPUBLICAN

SACK

TRICYCLE

YOUNG

ADORABLE

BEAUTIFUL

GIRLGUY

YOUTH

GLAD

GROWN−UPS

TOTE

FIELDCHEERLEADER

OVERWHELM

FREEDOM

LACE

WORRY

DICEFOOTBALL

PATTERN

SET

STRESS

STRIPE

ACTIVITY

CHEST

PANTS

PLACE

PLAID

POPULAR

STATUS

STRAP

AREA

CARD

POLYESTER

SUIT

ARENAATHLETIC

BASE

CASUAL

LEATHER

OFFICIALOLYMPICS

PLAYER

SITE

SPORTS

ANXIETY

DEAL

FANCY

OPPONENT PLAYING

REFEREE

SASH

STADIUMATHLETE

BASEBALL

CARDS

DEFENSE

HIP

POCKET

PROM

BANG BAT

COAT

EXCITEMENT

FASTEN

FORMAL

MONUMENT

PENGUIN

SERIES

SHIRTSTARCH

STATUE

TEAM

TIE

BUTTON

POSITION

PUT

SHORTS

VOLLEYBALL

WEAR

ACE

BRA

CLUE

CONTEMPORARY

DECK

DRESS

JACKET

LOOSEN

MODERN

SKIRT

SOFTBALL CHESS

COLLAR

FAMOUS

JOCK

SWEATER

WAIST

BASKETBALL

CHARGE

COACHGAME

JEANS

LEAGUE

LOCATION

SEAM

SPADE

UMPIRE

CREASE CUFF

FLANNELFRILL

PITCHER

SPORT

WHERE

ZIPPERBELT

BLOUSE

CHECKERS

GOWN

HEM

JOKERMONOPOLYTACKLE

TUXEDO

CATCHERIVY

LAPEL

PITCH

POKER

RUMMY

SLEEVE

SQUAD

TOUCHDOWN

CREDIT

SPADES

VEST

JEOPARDY

LIBERTY

SLACKS

SOCCER

BUCKLE

OFFENSE

SHOELACE

DENIM

TROUSERS

Page 25: Talk data sciencemeetup

LOCAL

EMPIREKEEPER

PASSAGE

STALK

DEPLETION

DOOR

INTEREST

THRESHOLD

ENVIRONMENT

HARVEST

KINGDOM

MINDEDTURN

BEYONDBREEZEWAY

DYNASTY

FENCE

HALLWAY

HANDLEINTIMATE

DOORWAY

HALL

CHINA

DICTATOR

VEER

AWAY

CARTOON

EDGE

OZONE

ROYAL

SURROUNDING

CASTLE

COMBINATION

LAYER

RULER

EMPEROR

MOAT PICKL

SOW

CORN

KNOCK

LIGHTNING

LOCK

MONARCHY

PALACEPICKLES

REAP

SPINACH

SURROUNDINGS

DISTANCEFURTHER

KEYS

LONG

OPENING

PRINCE

RING SCARECROW

VACANCYENGLAND

FAR

GARAGE

GATE

LATCH

MAT

BOLT

CLOSING

ROYALTY

JUICE

MONARCH

PRINCESSCLOSE

DISTANT

KEY

ROMAN

APART

CROWN

ENTRANCE

KING

POPEYE

WELL−BEING

QUEEN

SHUT

THRONE HINGE

KNOB

CLOSED

CORRIDOR

OPEN

BEETLE

DILL

CHEERLEADER

MOLE

RESPECT

CONGRESS

FREEDOM

GROW

KIDS

PROCEDURE

PUBERTY

CROOKED

GROWTH PINK

UNSURE

WALKER WORM

CURVED

PARENTS

RESPONSIBILITY

UNEVEN

WART

BUREAU

FRESH HERITAGE

INSURANCERULE

SEXY

AGE

ANARCHY

CONFIDENT

CURVE

DINOSAUR

DISGUST

MAYOR

MODEL

OFFICIAL

PLAY DOUGHCARRY

GODDESS

GOVERNMENT

INSTRUCTION

LOOKS

PRESIDENT

ANCESTOR

CUTE

DEVICE

DIRECTIONS

FEDERAL

GOO

LINE

OBEY

SENATOR

TANGENT

VULGAR

ADULTS

BAG

BEAUTY

CHILDREN

DEODORANT

ERECT

GORGEOUS

GRANDPARENTS

INSTRUCTIONS

KNITTING

LUNCH

MAGGOT

POLITICIAN

REPUBLIC

RUST

APPEARANCEATTRACTIVE

AWKWARD

BORDERCONSTITUTION

DEMOCRACY

DEVELOP

EGYPT

GROSS

GROWN

KNAPSACK

LAWS

PRINCIPLE

SLIMYTARNISH

BEAST

CAMPAIGN

IMMATUREMATURE

MODERN

SLIMESLUG

TAXES

WRINKLEYUCK

BENT

CANE

CORRUPT

DISGUSTING

LAW

MONARCHYOLIGARCHY

POLICY

RESTRICTION

SENATE

STALE

UGLY

AMERICABOUNDARY

BUGLE

FOSSIL

GOVERNOR

HANDSOME

LEGISLATUREPOLITICS

REPULSIVE

SURE

UNUSED

WORN

ADULT

ANCIENT

ELDERS

GAL

NASTY

RULESSTRAIGHT

USED

YEARS

DEMOCRAT

FEEBLE

FOLLOW

GRANDPA

PRETTY

USA

ANTIQUE BALD

BOY

CERTAIN

GRANDMA

NEW

SCOUT

WISEELDERLY

OLD

POSITIVE

REPUBLICAN

SACK

TRICYCLE

YOUNG

ADORABLE

BEAUTIFUL

GIRLGUY

YOUTH

GLAD

GROWN−UPS

TOTE

Page 26: Talk data sciencemeetup

I want to give this stuff a try!

• Type “t-SNE” into Google, and click the first link

• You’ll find papers, examples, and implementations (in Matlab, Python, R, and C++)

• You can also drop me a line: [email protected]

Page 27: Talk data sciencemeetup