Top Banner
1 SUPERFICIAL DATA ANALYSIS Exploring Millions of Social Stereotypes Presentators Nguyen Dao Tan Bao Cao Dinh Qui Pham Huy Thanh Instructor Prof. Lothar Piepmeyer
83

Superficial data analysis

Nov 01, 2014

Download

Technology

Bao Nguyen

Superficial data analysis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Superficial data analysis

1

SUPERFICIALDATA ANALYSIS

ExploringMillions of Social

Stereotypes

Presentators Nguyen Dao Tan Bao

Cao Dinh QuiPham Huy Thanh

InstructorProf. Lothar Piepmeyer

Page 2: Superficial data analysis

Superficial Data Analysis

2

How the stereotypes and our appearances influence the way we are perceived ?

The answers were found by analyzing facts in large pool of data collected from diverse group of people

Page 3: Superficial data analysis

Let us tell you the story about

FaceStat.com

from Brendan O’Connor & Lukas Biewald

We love the story

Page 4: Superficial data analysis

4

How do we perceive AGE, GENDER, INTELLIGENCE, AND ATTRACTIVENESS ?

WHAT INSIGHT CAN WE EXTRACT from millions of anonymous opinions?

Page 5: Superficial data analysis

Collect the data

FaceStat runs on an SQL database.

Judgment of user is taken and saves as a set of (face ID,

attribute, judgment) triples.

Exploring the relationships between different types of perceived attributes.

5

Page 6: Superficial data analysis

Collect the data

Example question :

“How old do I look?”

Look at age judgments’ value and count how many times each value occurs and order by this count

We have 10 million rows of data that can be extracted from the database

6

1st number is the frequency count.

2nd string is the response value

Page 7: Superficial data analysis

Clean up the data

Problematic of data:

“How old do I look?”

Error: \r\n

Outliers: rare values

Format of user responses: Text instead of number

7

1st number is the frequency count.

2nd string is the response value

Page 8: Superficial data analysis

Clean up the data

Challenges in

“preprocessing data”

Mapping from multiple-choice responses to numerical values: “very trustworthy” vs “not to be trusted”

Aggregate results from multiple people into a single description of a face. problematic of data

Missing values

8

1st number is the frequency count.

2nd string is the response value

Page 9: Superficial data analysis

9

Expected data

Page 10: Superficial data analysis

10

Toolkits for Data Analysis

Page 11: Superficial data analysis

11 age correlations

Page 12: Superficial data analysis

Further Investigate

Distribution of age values:“Outliers”

Remove the outliers: Select rows with age less than 100

12

Figure 17-3 : Initial histogram of Face age Data

Page 13: Superficial data analysis

Further Investigate

13

Figure 17-4 : Histogram of cleaned Face age Data

Page 14: Superficial data analysis

Age, Attractive and Gender

14

Figure 17-5 : Scatterplot of attractiveness versus age,

colored by gender.

Pink: Female

Blue : Male

Page 15: Superficial data analysis

Age, Attractive and Gender

15

Figure 17-6 : smoothed scatterplots for attractiveness

versus age, one plot per gender.

Page 16: Superficial data analysis

Age, Attractive and Gender

“ How does age affect attractiveness ? “

• We compute 95% confidence intervals

• Fit a loess curve to help visualize aggregate patterns in this noisy sequential data

16

Page 17: Superficial data analysis

Age, Attractive and Gender

“ How does age affect attractiveness ? “

17

Figure 17-7 :

smoothed scatterplots for attractiveness versus age, one plot per gender.

Page 18: Superficial data analysis

Age, Attractive and Gender

“ How does age affect attractiveness ? “

• Women are generally judged as more attractive than men across all ages except babies.

• Babies are found to be most attractive, but the attractiveness drops until around age 18 after which it rises and peaks around age 27 After that, attractiveness drops until around age 50

18

Page 19: Superficial data analysis

Attributes Correlations

How about the others ?

Page 20: Superficial data analysis

Attributes Correlations

How about the others ?

Intelligence

Page 21: Superficial data analysis

Attributes Correlations

How about the others ?

Intelligence

Weight

Page 22: Superficial data analysis

Attributes Correlations

How about the others ?

Intelligence

Weight

Trustworthy

Page 23: Superficial data analysis

Attributes Correlations

How about the others ?

Intelligence

Weight

Trustworthy

Outfit

Page 24: Superficial data analysis

Attributes Correlations

How about the others ?

Intelligence

Weight

Trustworthy

Outfit

Wealth

Page 25: Superficial data analysis

We can do with the same step ...

Attributes Correlations

Page 26: Superficial data analysis

We can do with the same step ...

Or …..

We can put everything in a big picture

Attributes Correlations

Page 27: Superficial data analysis

We use R language to make

Pearson Correlation Matrix

Attributes Correlations

Page 28: Superficial data analysis
Page 29: Superficial data analysis
Page 30: Superficial data analysis
Page 31: Superficial data analysis

Woman are judged more intelligent than men

Woman are judged more likely to win a dog fight

Dress size is weakly correlated to weight

Attributes Correlations

Page 32: Superficial data analysis

LOOKING

AT

THE TAGS

Page 33: Superficial data analysis

Describe me in one word : ………………………………………..

Page 34: Superficial data analysis

Describe me in one word : ………FREE FORM TAGS……..

Page 35: Superficial data analysis

Describe me in one word : ………FREE FORM TAGS……..

So complicated to process

Page 36: Superficial data analysis

Describe me in one word : Cute !

Page 37: Superficial data analysis

Describe me in one word : Pls, call me - 091231512

Page 38: Superficial data analysis

Describe me in one word : abc xyz aK&*$(#k,,fh..

Page 39: Superficial data analysis

Let’s use R language to examine the tags !

THE TAGS

Page 40: Superficial data analysis

Most common tags ?

THE TAGS

Page 41: Superficial data analysis

Least Common Tags ?

THE TAGS

Page 42: Superficial data analysis

THE TAGS

“cute” and “Cute” can be merged ?

Page 43: Superficial data analysis

THE TAGS

“hot” and “HOT!!!” have differentsemantic content !!!

“cute” and “Cute” can be merged ?

Page 44: Superficial data analysis

THE TAGS

“hot” and “HOT!!!” have differentsemantic content !!!

“cute” and “Cute” can be merged ?

Unknown language !?

Page 45: Superficial data analysis

290,000 unique tags

out of 2.4 million total.

.

THE TAGS

Page 46: Superficial data analysis

290,000 unique tags

out of 2.4 million total.

The top 1,000 unique tags

have 1.4 million occurrences

.

THE TAGS

Page 47: Superficial data analysis
Page 48: Superficial data analysis
Page 49: Superficial data analysis

How do the tags

fit in

with the rest of our data?.

THE TAGS

Page 50: Superficial data analysis
Page 51: Superficial data analysis

WHICH WORDS

ARE

GENDERED ?

Page 52: Superficial data analysis

Which description tags

are most characteristic of

male or female ?

GENDERED WORDS

Page 53: Superficial data analysis

Which description tags

are most characteristic of

male or female ?

GENDERED WORDS

Ex : handsome

Page 54: Superficial data analysis

Which description tags

are most characteristic of

male or female ?

GENDERED WORDS

Ex : handsome

Page 55: Superficial data analysis

Which description tags

are most characteristic of

male or female ?

GENDERED WORDS

Ex : handsome, makeup

Page 56: Superficial data analysis

Which description tags

are most characteristic of

male or female ?

GENDERED WORDS

Ex : handsome, makeup

Page 57: Superficial data analysis

Which description tags

are most characteristic of

male or female ?

GENDERED WORDS

Ex : handsome, makeup, shopping

Page 58: Superficial data analysis

Which description tags

are most characteristic of

male or female ?

GENDERED WORDS

Ex : handsome, makeup, shopping

Page 59: Superficial data analysis

Which description tags

are most characteristic of

male or female ?

GENDERED WORDS

Ex : handsome, makeup, shopping, gamer

Page 60: Superficial data analysis

Which description tags

are most characteristic of

male or female ?

GENDERED WORDS

Ex : handsome, makeup, shopping, gamer

Page 61: Superficial data analysis

How to do ?

Count the wordsthat occur most often for men or for women ?

Page 62: Superficial data analysis

Score tags by the ratio of occurrences between genders

GENDERED WORDS

Page 63: Superficial data analysis

Score tags by the ratio of occurrences between genders

How characteristic a tag T is for gender G

GENDERED WORDS

Page 64: Superficial data analysis

GENDERED WORDS

For male

Page 65: Superficial data analysis

For female

GENDERED WORDS

Page 66: Superficial data analysis

What are the typical types of people in our data?

CuteLoser

flirty

fratboy

Playeridiot

Page 67: Superficial data analysis

Data Mining

Supervised Learning

Unsupervised learning

Association Rules

Clustering

…Classificatio

nRegressio

n

CLUSTERING

Decision

Tree

Have a target

attribute

DON’T have a target

attribute

Labelled data

Unlabelled data

Page 68: Superficial data analysis

Definition: grouping together objects that are similar to each other

Applications:-Marketing segmentation-Business-Healthcare-Document retrieve-Etc…

CLUSTERING

Page 69: Superficial data analysis

K-MEANS CLUSTERING

The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k < n.

Page 70: Superficial data analysis

How the K-Mean Clustering algorithm works?

Page 71: Superficial data analysis

A Simple example showing the implementation of k-means algorithm

(using K=2)

Page 72: Superficial data analysis

Step 1:Initialization: Randomly we choose following two centroids

(k=2) for two clusters.In this case the 2 centroid are: m1=(1.0,1.0) and

m2=(5.0,7.0).

Page 73: Superficial data analysis

Step 2:

Thus, we obtain two clusters containing:

{1,2,3} and {4,5,6,7}.

Their new centroids are:

Page 74: Superficial data analysis

Step 3:

Now using these centroids we compute the Euclidean distance of each object, as shown in table.

Therefore, the new clusters are:

{1,2} and {3,4,5,6,7}

Next centroids are: m1=(1.25,1.5) and m2 = (3.9,5.1)

Page 75: Superficial data analysis

Step 4 :

The clusters obtained are:

{1,2} and {3,4,5,6,7}

Therefore, there is no change in the cluster.

Thus, the algorithm comes to a halt here and final result consist of 2 clusters {1,2} and {3,4,5,6,7}.

Page 76: Superficial data analysis

PLOT

Page 77: Superficial data analysis

Per-face Data

Page 78: Superficial data analysis

K=6 clusters and 8 attributes

Blue custer

Green cluster

Red cluster

Turquoise cluster

Orange

cluster

Purple cluster

Page 79: Superficial data analysis
Page 80: Superficial data analysis
Page 81: Superficial data analysis
Page 82: Superficial data analysis

Conclusion

The data shows people hold some familiar stereotypes.

Let’s data speak it self.

Page 83: Superficial data analysis

Q&A