1 SUPERFICIAL DATA ANALYSIS Exploring Millions of Social Stereotypes Presentators Nguyen Dao Tan Bao Cao Dinh Qui Pham Huy Thanh Instructor Prof. Lothar Piepmeyer
Nov 01, 2014
1
SUPERFICIALDATA ANALYSIS
ExploringMillions of Social
Stereotypes
Presentators Nguyen Dao Tan Bao
Cao Dinh QuiPham Huy Thanh
InstructorProf. Lothar Piepmeyer
Superficial Data Analysis
2
How the stereotypes and our appearances influence the way we are perceived ?
The answers were found by analyzing facts in large pool of data collected from diverse group of people
Let us tell you the story about
FaceStat.com
from Brendan O’Connor & Lukas Biewald
We love the story
4
How do we perceive AGE, GENDER, INTELLIGENCE, AND ATTRACTIVENESS ?
WHAT INSIGHT CAN WE EXTRACT from millions of anonymous opinions?
Collect the data
FaceStat runs on an SQL database.
Judgment of user is taken and saves as a set of (face ID,
attribute, judgment) triples.
Exploring the relationships between different types of perceived attributes.
5
Collect the data
Example question :
“How old do I look?”
Look at age judgments’ value and count how many times each value occurs and order by this count
We have 10 million rows of data that can be extracted from the database
6
1st number is the frequency count.
2nd string is the response value
Clean up the data
Problematic of data:
“How old do I look?”
Error: \r\n
Outliers: rare values
Format of user responses: Text instead of number
7
1st number is the frequency count.
2nd string is the response value
Clean up the data
Challenges in
“preprocessing data”
Mapping from multiple-choice responses to numerical values: “very trustworthy” vs “not to be trusted”
Aggregate results from multiple people into a single description of a face. problematic of data
Missing values
8
1st number is the frequency count.
2nd string is the response value
9
Expected data
10
Toolkits for Data Analysis
11 age correlations
Further Investigate
Distribution of age values:“Outliers”
Remove the outliers: Select rows with age less than 100
12
Figure 17-3 : Initial histogram of Face age Data
Further Investigate
13
Figure 17-4 : Histogram of cleaned Face age Data
Age, Attractive and Gender
14
Figure 17-5 : Scatterplot of attractiveness versus age,
colored by gender.
Pink: Female
Blue : Male
Age, Attractive and Gender
15
Figure 17-6 : smoothed scatterplots for attractiveness
versus age, one plot per gender.
Age, Attractive and Gender
“ How does age affect attractiveness ? “
• We compute 95% confidence intervals
• Fit a loess curve to help visualize aggregate patterns in this noisy sequential data
16
Age, Attractive and Gender
“ How does age affect attractiveness ? “
17
Figure 17-7 :
smoothed scatterplots for attractiveness versus age, one plot per gender.
Age, Attractive and Gender
“ How does age affect attractiveness ? “
• Women are generally judged as more attractive than men across all ages except babies.
• Babies are found to be most attractive, but the attractiveness drops until around age 18 after which it rises and peaks around age 27 After that, attractiveness drops until around age 50
18
Attributes Correlations
How about the others ?
Attributes Correlations
How about the others ?
Intelligence
Attributes Correlations
How about the others ?
Intelligence
Weight
Attributes Correlations
How about the others ?
Intelligence
Weight
Trustworthy
Attributes Correlations
How about the others ?
Intelligence
Weight
Trustworthy
Outfit
Attributes Correlations
How about the others ?
Intelligence
Weight
Trustworthy
Outfit
Wealth
We can do with the same step ...
Attributes Correlations
We can do with the same step ...
Or …..
We can put everything in a big picture
Attributes Correlations
We use R language to make
Pearson Correlation Matrix
Attributes Correlations
Woman are judged more intelligent than men
Woman are judged more likely to win a dog fight
Dress size is weakly correlated to weight
Attributes Correlations
LOOKING
AT
THE TAGS
Describe me in one word : ………………………………………..
Describe me in one word : ………FREE FORM TAGS……..
Describe me in one word : ………FREE FORM TAGS……..
So complicated to process
Describe me in one word : Cute !
Describe me in one word : Pls, call me - 091231512
Describe me in one word : abc xyz aK&*$(#k,,fh..
Let’s use R language to examine the tags !
THE TAGS
Most common tags ?
THE TAGS
Least Common Tags ?
THE TAGS
THE TAGS
“cute” and “Cute” can be merged ?
THE TAGS
“hot” and “HOT!!!” have differentsemantic content !!!
“cute” and “Cute” can be merged ?
THE TAGS
“hot” and “HOT!!!” have differentsemantic content !!!
“cute” and “Cute” can be merged ?
Unknown language !?
290,000 unique tags
out of 2.4 million total.
.
THE TAGS
290,000 unique tags
out of 2.4 million total.
The top 1,000 unique tags
have 1.4 million occurrences
.
THE TAGS
How do the tags
fit in
with the rest of our data?.
THE TAGS
WHICH WORDS
ARE
GENDERED ?
Which description tags
are most characteristic of
male or female ?
GENDERED WORDS
Which description tags
are most characteristic of
male or female ?
GENDERED WORDS
Ex : handsome
Which description tags
are most characteristic of
male or female ?
GENDERED WORDS
Ex : handsome
Which description tags
are most characteristic of
male or female ?
GENDERED WORDS
Ex : handsome, makeup
Which description tags
are most characteristic of
male or female ?
GENDERED WORDS
Ex : handsome, makeup
Which description tags
are most characteristic of
male or female ?
GENDERED WORDS
Ex : handsome, makeup, shopping
Which description tags
are most characteristic of
male or female ?
GENDERED WORDS
Ex : handsome, makeup, shopping
Which description tags
are most characteristic of
male or female ?
GENDERED WORDS
Ex : handsome, makeup, shopping, gamer
Which description tags
are most characteristic of
male or female ?
GENDERED WORDS
Ex : handsome, makeup, shopping, gamer
How to do ?
Count the wordsthat occur most often for men or for women ?
Score tags by the ratio of occurrences between genders
GENDERED WORDS
Score tags by the ratio of occurrences between genders
How characteristic a tag T is for gender G
GENDERED WORDS
GENDERED WORDS
For male
For female
GENDERED WORDS
What are the typical types of people in our data?
CuteLoser
flirty
fratboy
Playeridiot
…
Data Mining
Supervised Learning
Unsupervised learning
Association Rules
Clustering
…Classificatio
nRegressio
n
CLUSTERING
…
Decision
Tree
Have a target
attribute
DON’T have a target
attribute
Labelled data
Unlabelled data
Definition: grouping together objects that are similar to each other
Applications:-Marketing segmentation-Business-Healthcare-Document retrieve-Etc…
CLUSTERING
K-MEANS CLUSTERING
The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k < n.
How the K-Mean Clustering algorithm works?
A Simple example showing the implementation of k-means algorithm
(using K=2)
Step 1:Initialization: Randomly we choose following two centroids
(k=2) for two clusters.In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
Thus, we obtain two clusters containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
Step 3:
Now using these centroids we compute the Euclidean distance of each object, as shown in table.
Therefore, the new clusters are:
{1,2} and {3,4,5,6,7}
Next centroids are: m1=(1.25,1.5) and m2 = (3.9,5.1)
Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
Therefore, there is no change in the cluster.
Thus, the algorithm comes to a halt here and final result consist of 2 clusters {1,2} and {3,4,5,6,7}.
PLOT
Per-face Data
K=6 clusters and 8 attributes
Blue custer
Green cluster
Red cluster
Turquoise cluster
Orange
cluster
Purple cluster
Conclusion
The data shows people hold some familiar stereotypes.
Let’s data speak it self.
Q&A