Superficial data analysis

1

SUPERFICIALDATA ANALYSIS

ExploringMillions of Social

Stereotypes

Presentators Nguyen Dao Tan Bao

Cao Dinh QuiPham Huy Thanh

InstructorProf. Lothar Piepmeyer

Superficial Data Analysis

2

How the stereotypes and our appearances influence the way we are perceived ?

The answers were found by analyzing facts in large pool of data collected from diverse group of people

Let us tell you the story about

FaceStat.com

from Brendan O’Connor & Lukas Biewald

We love the story

4

How do we perceive AGE, GENDER, INTELLIGENCE, AND ATTRACTIVENESS ?

WHAT INSIGHT CAN WE EXTRACT from millions of anonymous opinions?

Collect the data

FaceStat runs on an SQL database.

Judgment of user is taken and saves as a set of (face ID,

attribute, judgment) triples.

Exploring the relationships between different types of perceived attributes.

5

Collect the data

Example question :

“How old do I look?”

Look at age judgments’ value and count how many times each value occurs and order by this count

We have 10 million rows of data that can be extracted from the database

6

1st number is the frequency count.

2nd string is the response value

Clean up the data

Problematic of data:

“How old do I look?”

Error: \r\n

Outliers: rare values

Format of user responses: Text instead of number

7



Clean up the data

Challenges in

“preprocessing data”

Mapping from multiple-choice responses to numerical values: “very trustworthy” vs “not to be trusted”

Aggregate results from multiple people into a single description of a face. problematic of data

Missing values

8



9

Expected data

10

Toolkits for Data Analysis

11 age correlations

Further Investigate

Distribution of age values:“Outliers”

Remove the outliers: Select rows with age less than 100

12

Figure 17-3 : Initial histogram of Face age Data

Further Investigate

13

Figure 17-4 : Histogram of cleaned Face age Data

Age, Attractive and Gender

14

Figure 17-5 : Scatterplot of attractiveness versus age,

colored by gender.

Pink: Female

Blue : Male


15

Figure 17-6 : smoothed scatterplots for attractiveness

versus age, one plot per gender.


“ How does age affect attractiveness ? “

• We compute 95% confidence intervals

• Fit a loess curve to help visualize aggregate patterns in this noisy sequential data

16



17

Figure 17-7 :

smoothed scatterplots for attractiveness versus age, one plot per gender.



• Women are generally judged as more attractive than men across all ages except babies.

• Babies are found to be most attractive, but the attractiveness drops until around age 18 after which it rises and peaks around age 27 After that, attractiveness drops until around age 50

18

Attributes Correlations

How about the others ?



Intelligence



Intelligence

Weight



Intelligence

Weight

Trustworthy



Intelligence

Weight

Trustworthy

Outfit



Intelligence

Weight

Trustworthy

Outfit

Wealth

We can do with the same step ...


We can do with the same step ...

Or …..

We can put everything in a big picture


We use R language to make

Pearson Correlation Matrix


Woman are judged more intelligent than men

Woman are judged more likely to win a dog fight

Dress size is weakly correlated to weight


LOOKING

AT

THE TAGS

Describe me in one word : ………………………………………..

Describe me in one word : ………FREE FORM TAGS……..

Describe me in one word : ………FREE FORM TAGS……..

So complicated to process

Describe me in one word : Cute !

Describe me in one word : Pls, call me - 091231512

Describe me in one word : abc xyz aK&*$(#k,,fh..

Let’s use R language to examine the tags !

THE TAGS

Most common tags ?

THE TAGS

Least Common Tags ?

THE TAGS

THE TAGS

“cute” and “Cute” can be merged ?

THE TAGS

“hot” and “HOT!!!” have differentsemantic content !!!


THE TAGS

“hot” and “HOT!!!” have differentsemantic content !!!


Unknown language !?

290,000 unique tags

out of 2.4 million total.

.

THE TAGS

290,000 unique tags

out of 2.4 million total.

The top 1,000 unique tags

have 1.4 million occurrences

.

THE TAGS

How do the tags

fit in

with the rest of our data?.

THE TAGS

WHICH WORDS

ARE

GENDERED ?

Which description tags

are most characteristic of

male or female ?

GENDERED WORDS



male or female ?

GENDERED WORDS

Ex : handsome



male or female ?

GENDERED WORDS

Ex : handsome



male or female ?

GENDERED WORDS

Ex : handsome, makeup



male or female ?

GENDERED WORDS

Ex : handsome, makeup



male or female ?

GENDERED WORDS

Ex : handsome, makeup, shopping



male or female ?

GENDERED WORDS

Ex : handsome, makeup, shopping



male or female ?

GENDERED WORDS

Ex : handsome, makeup, shopping, gamer



male or female ?

GENDERED WORDS

Ex : handsome, makeup, shopping, gamer

How to do ?

Count the wordsthat occur most often for men or for women ?

Score tags by the ratio of occurrences between genders

GENDERED WORDS

Score tags by the ratio of occurrences between genders

How characteristic a tag T is for gender G

GENDERED WORDS

GENDERED WORDS

For male

For female

GENDERED WORDS

What are the typical types of people in our data?

CuteLoser

flirty

fratboy

Playeridiot

…

Data Mining

Supervised Learning

Unsupervised learning

Association Rules

Clustering

…Classificatio

nRegressio

n

CLUSTERING

…

Decision

Tree

Have a target

attribute

DON’T have a target

attribute

Labelled data

Unlabelled data

Definition: grouping together objects that are similar to each other

Applications:-Marketing segmentation-Business-Healthcare-Document retrieve-Etc…

CLUSTERING

K-MEANS CLUSTERING

The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k < n.

http://en.wikipedia.org/wiki/Data_clustering

http://en.wikipedia.org/wiki/Partition_of_a_set

How the K-Mean Clustering algorithm works?

A Simple example showing the implementation of k-means algorithm

(using K=2)

Step 1:Initialization: Randomly we choose following two centroids

(k=2) for two clusters.In this case the 2 centroid are: m1=(1.0,1.0) and

m2=(5.0,7.0).

Step 2:

Thus, we obtain two clusters containing:

{1,2,3} and {4,5,6,7}.

Their new centroids are:

Step 3:

Now using these centroids we compute the Euclidean distance of each object, as shown in table.

Therefore, the new clusters are:

{1,2} and {3,4,5,6,7}

Next centroids are: m1=(1.25,1.5) and m2 = (3.9,5.1)

Step 4 :

The clusters obtained are:

{1,2} and {3,4,5,6,7}

Therefore, there is no change in the cluster.

Thus, the algorithm comes to a halt here and final result consist of 2 clusters {1,2} and {3,4,5,6,7}.

PLOT

Per-face Data

K=6 clusters and 8 attributes

Blue custer

Green cluster

Red cluster

Turquoise cluster

Orange

cluster

Purple cluster

Conclusion

The data shows people hold some familiar stereotypes.

Let’s data speak it self.

Q&A

Superficial data analysis

Technology

free form

age affect

attractiveness

attractiveness

semantic content

4 million

smoothed scatterplots

2nd string