Top Banner
Author Gender Identification from Text By: El Hebri Khiari 200790830 COE 589 – Digital Forensics Due: Tuesday 25 th September 2012 1
40

Author Gender Identification from Text

Feb 23, 2016

Download

Documents

veata

Author Gender Identification from Text. By: El Hebri Khiari 200790830 COE 589 – Digital Forensics Due: Tuesday 25 th September 2012. Outline. Introduction & Motivation Authorship Attribution Detecting Genders Contribution(s) Problem Formulation Data Pre-Processing - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Author Gender Identification from Text

1

Author Gender Identification from Text

By: El Hebri Khiari 200790830COE 589 – Digital ForensicsDue: Tuesday 25th September 2012

Page 2: Author Gender Identification from Text

2

Outline• Introduction & Motivation• Authorship Attribution• Detecting Genders• Contribution(s)• Problem Formulation• Data Pre-Processing

o Reuters Newsgroup Dataseto Enron Email dataset

• Feature Selection & Extraction• Classification Techniques• Experimental Results• Tool Results• Conclusion

Page 3: Author Gender Identification from Text

3

Introduction & Motivation• Text most prevalent on Internet• Applications

o Twittero Craigslisto Facebook

• Statisticso 2008 33.1% increase in online crimeo October 2009 1.69 billion Internet users

• Motivations [??]o Anonymityo Faking gendero “MySpace mom”

o Emailo Blogso Chat rooms

Page 4: Author Gender Identification from Text

4

Introduction & Motivation cont.

• Question– “Given a short text document, can we identify if the author is a

man or a woman?”

Page 5: Author Gender Identification from Text

5

Authorship Attribution• Features

o Stylistic tendency Stylometric analysiso Over 1000 featureso Author’s state of mind

• Statistical methodso Word-length Distributiono Bayesian Classifiero Principle Component Analysis o Cluster Analysis

• Machine Learningo Decision Treeo Neural Networkso Support Vector Machine (SVM)

Page 6: Author Gender Identification from Text

6

Authorship Attribution cont.

• Different problemo Abstractiono Length of messageso Special linguistic elements (emoticons)o Time constraints

Page 7: Author Gender Identification from Text

7

Detecting Genders• Socially-constructed Gender

• Fundamental questionso “Do men & women inherently use different classes of language

styles?” “What are reliable linguistic features that indicate gender?”

• Robin Lakoff (1975)o Lexical, Syntactic & pragmatic featureso Specialized vocabulary, expletives, etc.

Page 8: Author Gender Identification from Text

8

Detecting Genders cont.• Mary Talbot (1998)

o Influence of social divisions

• Mulac et al.(1990), Mulac & Lundell (1994)o Students’ impromptu essayso Descriptions of photographso Dyadic interactions between strangerso Written communication & face-to-face interaction

Page 9: Author Gender Identification from Text

9

Contribution(s)• Little work on GI [??]

• Proposeo Robust Classifier

Based on content-free text messages Internet text messages

o Features types• Design

o Set of measureso Classifiers & Parameter optimization

Page 10: Author Gender Identification from Text

10

Problem Formulation

• Binary problemo Class1 if author of e is maleo Class2 if author of e is female

• Set of featureso Constant for same gendero d-dimensional vector

Page 11: Author Gender Identification from Text

11

Problem Formulation cont.(1)• Classifier

o Learning Classifiery = f(x), from a set of training examples

D = {(x1,y1), (x2,y2), … , (xN,yN)}

Let X = {xi, i = 1,2, … , N}

where xi is a d-dimensional vector

ALet Y = {yi, i = 1,2, … , N}

where yi{+1,-1} indicating class1(-1) or class2(+1)

Page 12: Author Gender Identification from Text

12

Problem Formulation cont.(2)

Page 13: Author Gender Identification from Text

13

Dataset Pre-processing

• Two extremeso Newsgroup messages

Reuters newsgroup dataseto Private Emails

Enron email dataset

Page 14: Author Gender Identification from Text

14

Dataset Pre-processing cont.(1)• Reuters newsgroup dataset

o Stories by Reuters journalists, 1996 – 1997o Few Hundred to Thousand words

o Discard neutral nameso Remove unnecessary info & XML formattingo Limiting quotes, 0.002 per character

o >200 and <1000 words

Page 15: Author Gender Identification from Text

15

Dataset Pre-processing cont.(2)• Enron email dataset

o Emails made public by Federal Energy Regulatory Commissiono Integrity problems some emails removedo Invalid emailso Final set

517,431 emails 150 users, 3.5 years Plain text, no attachments

o Removed headers & reply textso Removed duplicated emailso Removed ultra-short emailso > 50 and <100 words

Page 16: Author Gender Identification from Text

16

Feature Set Selection• Question

o “What are good linguistic features that indicate gender?”

• Human psychology & extensive experimentationo Character-basedo Word-basedo Syntactico Structure-basedo Function words

• Total of 545 features

Page 17: Author Gender Identification from Text

17

Feature Set Selection cont.(1)• Character-based features

o 29 Stylometric featureso Widely adopted in Authorship attributiono Examples

Number of white space characters Number of special characters

Page 18: Author Gender Identification from Text

18

Feature Set Selection cont.(2)• Word-based features

o 33 statistical metrics Vocabulary richness Yule’s K measure Entropy measure

o 68 pshyco-linguistic features Linguistic & Word Count (LIWC)

o Individuals benefiting from writing Positive & negative emotional words Cognitive words (cause, know) Switch use of pronouns

Page 19: Author Gender Identification from Text

19

Page 20: Author Gender Identification from Text

20

Feature Set Selection cont.(3)• Syntactic features

o Sentence levelo Regular and informal punctuationo Mulac(1998)

Women use more question marks

Page 21: Author Gender Identification from Text

21

Feature Set Selection cont.(4)• Structure-based features

o Layout Paragraphs length Use of greetings

o Big influence in online documents

Page 22: Author Gender Identification from Text

22

Feature Set Selection cont.(5)• Function words

o Ambiguous meaningo Grammatical relationshipso Different set from word-based

Importance roleo 9 gender-linked features

• Women use emotionally-intensive & affective adjectives• Men express ‘independence’ First-person singular pronouns

Page 23: Author Gender Identification from Text

23

Page 24: Author Gender Identification from Text

24

Automatic Extraction • Normalization

Page 25: Author Gender Identification from Text

25

Classification Techniques

• Three classifierso Bayesian-based logistic regressiono AdaBoost Decision treeo Support Vector Machine (SVM)

Page 26: Author Gender Identification from Text

26

Classification Techniques cont.(1)

• Bayesian-based logistic regressiono Probability

o Threshold set to 0.5

Page 27: Author Gender Identification from Text

27

Classification Techniques cont.(2)

o Avoid overfittingo Assume with Normal distributiono Mean = 0, Variance o Assume with exponential distribution

o Transform into Laplace distribution

Page 28: Author Gender Identification from Text

28

Classification Techniques cont.(3)

o Assume components of are independento Overall prior of

o Posterior density given dataset D

Page 29: Author Gender Identification from Text

29

Classification Techniques cont.(4)

o Use log posterior

o Minimum –l() convex function Suitable for optimization

Page 30: Author Gender Identification from Text

30

Classification Techniques cont.(6)

• Decision Treeo Flowchart-like tree structureo Attribute Internal nodeo Outcome Brancho Class Terminal nodeo High variance Overfitting

• AdaBoost o Solid theoretical backgroundo Simpleo Accurate predictionso Proven to be successful

Page 31: Author Gender Identification from Text

31

Classification Techniques cont.(5)

o Assign equal weights to all training exampleso Weights with distribution Dt at tth round

o Generate weak learner X ht X Yo Test ht, new weight distributions Dt+1

o Repeat T times

Page 32: Author Gender Identification from Text

32

Classification Techniques cont.(7)

• Support Vector Machineo Linearly separable classeso Optimal

o Linearly inseparable

Page 33: Author Gender Identification from Text

33

Classification Techniques cont.(8)

o Non-linear problemo Use Kernel trick

Linear Polynomial Radial basis

Page 34: Author Gender Identification from Text

34

Experimental Results• Feature Extraction Python• Classifiers MatLab• Each experiment 10 times

Page 35: Author Gender Identification from Text

35

Experimental Results cont.(1)• SVM outperforms (76.75% & 82.23%)• Sharp improvements in AdaBoost• Small changes in Bayesian Logistic Regression

Page 36: Author Gender Identification from Text

36

Experimental Results cont.(2)

• Impact of parameterso >50 wordso >100 wordso >200 words

Page 37: Author Gender Identification from Text

37

Experimental Results cont.(3)

• Significance of feature setso >100 wordso One feature at a time

Page 38: Author Gender Identification from Text

38

Experimental Results cont.(4)

• Optimizationo 5% Feature size reduction 157 out of 545o Extraction time reduced from 1.35 to 3.77 secondso 3.03% drop in accuracy

Page 39: Author Gender Identification from Text

39

Tool Results

• male 64.46%• male 75.83%• male 59.89%• neutral 96.98% ??• male 58.31%• male 72.60%• male 63.30%• male 57.57%• male 73.89%• male 59.07%

• Actual Results: 5 male out of 10

Page 40: Author Gender Identification from Text

40

Conclusion

• Differences do exist between genders• SVM outperforms• Significant features [??]

o Word-based featureso Function wordso Structural features

o Increase data set better accuracy