Top Banner
Twitter Author Prediction from Tweets using Bayesian Network Hendy Irawan 23214344 TMDG 9 – Electrical Engineering - STEI ITB
32

Twitter Author Prediction from Tweets using Bayesian Network

Jul 20, 2015

Download

Technology

Hendy Irawan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Twitter Author Prediction from Tweets using Bayesian Network

Twitter Author Prediction

from Tweets using

Bayesian NetworkHendy Irawan

23214344

TMDG 9 – Electrical Engineering - STEI ITB

Page 2: Twitter Author Prediction from Tweets using Bayesian Network

Can We Predict the Author from a

Tweet?

Most authors have a distinct writing style

... And unique topics to talk about

... And signature distribution of words used to tweet

Can we train Bayesian Network so that occurrence of words in a tweet can be

used to infer the author of that tweet?

In summary: YES!

Disclaimer: Accuracy varies

In a test suite with @dakwatuna vs @farhatabbaslaw (very different tweet topics)

– 100% prediction accuracy is achieved

Page 3: Twitter Author Prediction from Tweets using Bayesian Network

Analysis & Implementation Plan

Visualize Word Distribution in Tweets with Word Clouds

Using R Statistical Language in RStudio

Implement in Java

Natural Language Preprocessing

Train Bayesian Network

Predict Tweet Author

Page 4: Twitter Author Prediction from Tweets using Bayesian Network

Visualize Word Distribution in Tweets

with Word Clouds

Using R Statistical Language in RStudio

All documentation and sources (open

source) available at:

http://ceefour.github.io/r-tutorials/

Install R Packages

libcurl4-openssl-dev, TwitteR,

httpuv, tm, wordcloud,

RColorBrewer

Setup Twitter Oauth

Grab Data

Prepare Stop Words

Make A Corpus

Word Cloud

Page 5: Twitter Author Prediction from Tweets using Bayesian Network

1. Install R Packages

Page 6: Twitter Author Prediction from Tweets using Bayesian Network

2. Setup Twitter OAuth

Page 7: Twitter Author Prediction from Tweets using Bayesian Network

3. Grab Data

Page 8: Twitter Author Prediction from Tweets using Bayesian Network

4. Prepare Stop Words

Page 9: Twitter Author Prediction from Tweets using Bayesian Network

5. Make A Corpus

Page 10: Twitter Author Prediction from Tweets using Bayesian Network

6. Visualize Word Cloud: @dakwatuna

Page 11: Twitter Author Prediction from Tweets using Bayesian Network

Word Clouds (2)

@suaradotcom @kompascom

Page 12: Twitter Author Prediction from Tweets using Bayesian Network

Word Clouds (3)

@VIVAnews @liputan6dotcom

Page 13: Twitter Author Prediction from Tweets using Bayesian Network

Word Clouds (3)

@pkspiyungan @MTlovenhoney

Page 14: Twitter Author Prediction from Tweets using Bayesian Network

Word Clouds (4)

@hidcom @farhatabbaslaw

Page 15: Twitter Author Prediction from Tweets using Bayesian Network

Java Implementation

Natural Language Preprocessing

Read tweets from CSV

Lower case

Remove http(s) links

Remove punctuation symbols

Remove numbers

Canonicalize different word forms

Remove stop words

Train Bayesian Network

Predict Tweet Author

Initial experiments and dataset

validation available at:

http://ceefour.github.io/r-

tutorials/

Java application source code (open

source) available on GitHub at:

https://github.com/lumenitb/nlu-

sentiment

Page 16: Twitter Author Prediction from Tweets using Bayesian Network

1. Read Tweets from CSV

/*** Read CSV file {@code f} and put its contents into {@link #rows},* {@link #texts}, and {@link #origTexts}.* @param f*/public void readCsv(File f) {

try (final CSVReader csv = new CSVReader(new FileReader(f))) {headerNames = csv.readNext(); // headerrows = csv.readAll();texts = rows.stream().map(it -> Maps.immutableEntry(it[0], it[1]))

.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));origTexts = ImmutableMap.copyOf(texts);

} catch (Exception e) {throw new RuntimeException("Cannot read " + f, e);

}}

Page 17: Twitter Author Prediction from Tweets using Bayesian Network

2. Lower Case

/*** Lower case all texts.*/public void lowerCaseAll() {

texts = Maps.transformValues(texts, String::toLowerCase);}

Page 18: Twitter Author Prediction from Tweets using Bayesian Network

3. Remove Links

/*** Remove http(s) links from texts.*/public void removeLinks() {

texts = Maps.transformValues(texts, it -> it.replaceAll("http(s?):\\/\\/(\\S+)", " "));}

Page 19: Twitter Author Prediction from Tweets using Bayesian Network

4. Remove Punctuation Symbols

/*** Remove punctuation symbols from texts.*/public void removePunctuation() {

texts = Maps.transformValues(texts, it -> it.replaceAll("[^a-zA-Z0-9]+", " "));}

Page 20: Twitter Author Prediction from Tweets using Bayesian Network

5. Remove Numbers

/*** Remove numbers from texts.*/public void removeNumbers() {

texts = Maps.transformValues(texts, it -> it.replaceAll("[0-9]+", ""));}

Page 21: Twitter Author Prediction from Tweets using Bayesian Network

6. Canonicalize Words

/*** Canonicalize different word forms using {@link #CANONICAL_WORDS}.*/public void canonicalizeWords() {

log.info("Canonicalize {} words for {} texts: {}",CANONICAL_WORDS.size(), texts.size(), CANONICAL_WORDS);CANONICAL_WORDS.entries().forEach(entry ->

texts = Maps.transformValues(texts,it -> it.replaceAll("(\\W|^)" + Pattern.quote(entry.getValue()) +

"(\\W|$)", "\\1" + entry.getKey() + "\\2")));

}

// Define contents of CANONICAL_WORDSfinal ImmutableMultimap.Builder<String, String> mmb = ImmutableMultimap.builder();mmb.putAll("yang", "yg", "yng");mmb.putAll("dengan", "dg", "dgn");mmb.putAll("saya", "sy");mmb.putAll("punya", "pny");mmb.putAll("ya", "iya");mmb.putAll("tidak", "tak", "tdk");mmb.putAll("jangan", "jgn", "jngn");mmb.putAll("jika", "jika", "bila");mmb.putAll("sudah", "udah", "sdh", "dah", "telah", "tlh");mmb.putAll("hanya", "hny");mmb.putAll("banyak", "byk", "bnyk");mmb.putAll("juga", "jg");mmb.putAll("mereka", "mrk", "mereka");mmb.putAll("gue", "gw", "gwe", "gua", "gwa");mmb.putAll("sebagai", "sbg", "sbgai");mmb.putAll("silaturahim", "silaturrahim", "silaturahmi", "silaturrahmi");mmb.putAll("shalat", "sholat", "salat", "solat");mmb.putAll("harus", "hrs");mmb.putAll("oleh", "olh");mmb.putAll("tentang", "ttg", "tntg");mmb.putAll("dalam", "dlm");mmb.putAll("banget", "bngt", "bgt", "bingit", "bingits");CANONICAL_WORDS = mmb.build();

Page 22: Twitter Author Prediction from Tweets using Bayesian Network

7. Remove Stop Words

/*** Remove stop words using {@link #STOP_WORDS_ID} and {@code additions}.* @param additions*/public void removeStopWords(String... additions) {

final Sets.SetView<String> stopWords = Sets.union(STOP_WORDS_ID, ImmutableSet.copyOf(additions));

log.info("Removing {} stop words for {} texts: {}",stopWords.size(), texts.size(), stopWords);

stopWords.forEach(stopWord ->texts = Maps.transformValues(texts, it ->it.replaceAll("(\\W|^)" + Pattern.quote(stopWord) +"(\\W|$)", "\\1\\2"))

);}

/*** Indonesian stop words.*/public static final Set<String> STOP_WORDS_ID = ImmutableSet.of(

"di", "ke", "ini", "dengan", "untuk", "yang", "tak", "tidak", "gak",

"dari", "dan", "atau", "bisa", "kita", "ada", "itu","akan", "jadi", "menjadi", "tetap", "per", "bagi", "saat","tapi", "bukan", "adalah", "pula", "aja", "saja","kalo", "kalau", "karena", "pada", "kepada", "terhadap","amp", // &amp;"rt" // RT:

);

Page 23: Twitter Author Prediction from Tweets using Bayesian Network

8. Split Text into Words

/*** Split texts into {@link #words}.*/public void splitWords() {

Splitter whitespace = Splitter.on(Pattern.compile("\\s+")).omitEmptyStrings().trimResults();

words = Maps.transformValues(texts,it -> whitespace.splitToList(it));

}

Page 24: Twitter Author Prediction from Tweets using Bayesian Network

Train Bayesian Network

BN Graph model Prior probabilities

Page 25: Twitter Author Prediction from Tweets using Bayesian Network

Train Bayesian Network: Java (1)

/*** Creates a {@link SentimentAnalyzer} then analyzes the file {@code f},* with limiting words to {@code wordLimit} (based on top word frequency),* and additional stop words of {@code moreStopWords} (base stop words* are {@link SentimentAnalyzer#STOP_WORDS_ID}.* @param f* @param wordLimit* @param moreStopWords* @return*/protected SentimentAnalyzer analyze(File f, int wordLimit, Set<String> moreStopWords) {

final SentimentAnalyzer sentimentAnalyzer = new SentimentAnalyzer();

sentimentAnalyzer.readCsv(f);sentimentAnalyzer.lowerCaseAll();sentimentAnalyzer.removeLinks();sentimentAnalyzer.removePunctuation();sentimentAnalyzer.removeNumbers();sentimentAnalyzer.canonicalizeWords();

sentimentAnalyzer.removeStopWords(moreStopWords.toArray(new String[] {}));

log.info("Preprocessed text: {}", sentimentAnalyzer.texts.entrySet().stream().limit(10)

.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue)));

sentimentAnalyzer.splitWords();log.info("Words: {}",

sentimentAnalyzer.words.entrySet().stream().limit(10).collect(Collectors.toMap(Map.Entry::getKey,

Map.Entry::getValue)));

final ImmutableMultiset<String> wordMultiset = Multisets.copyHighestCountFirst(HashMultiset.create(

sentimentAnalyzer.words.values().stream().flatMap(it -> it.stream()).collect(Collectors.toList())) );

final Map<String, Integer> wordCounts = new LinkedHashMap<>();

// only the N most used words

wordMultiset.elementSet().stream().limit(wordLimit).forEach( it -> wordCounts.put(it, wordMultiset.count(it)) );

log.info("Word counts (orig): {}", wordCounts);

// Normalize the twitterUser "vector" to length 1.0

// Note that this "vector" is actually user-specific, i.e. it's not a user-independent vector

long origSumSqrs = 0;for (final Integer it : wordCounts.values()) {

origSumSqrs += it * it;}double origLength = Math.sqrt(origSumSqrs);final Map<String, Double> normWordCounts =

Maps.transformValues(wordCounts, it -> it / origLength);

log.info("Word counts (normalized): {}", normWordCounts);

sentimentAnalyzer.normWordCounts = normWordCounts;

return sentimentAnalyzer;}

Page 26: Twitter Author Prediction from Tweets using Bayesian Network

Train Bayesian Network: Java (2)

/*** Train Bayesian network {@code bn}, with help of {@link #analyze(File, int, Set)}.* @param bn* @param f* @param screenName* @return*/protected SentimentAnalyzer train(BayesianNetwork bn, File f, String screenName) {

final SentimentAnalyzer analyzer = analyze(f, 100, ImmutableSet.of(screenName));

allWords.addAll(analyzer.normWordCounts.keySet());

for (final Map.Entry<String, Double> entry : analyzer.normWordCounts.entrySet()) {wordNormLengthByScreenName.put(screenName + "/" + entry.getKey(), entry.getValue());

}

return analyzer;}

Page 27: Twitter Author Prediction from Tweets using Bayesian Network

Predict Twitter Author:

“nasional” found

“nasional” found ->

85.37% probability of @dakwatuna

“nasional” found, “olga” missing ->

89.29% probability of @dakwatuna

Page 28: Twitter Author Prediction from Tweets using Bayesian Network

Predict Twitter author:

“olga” found

@dakwatuna never tweets

about “olga”

Not even once

Therefore, BN assumes

100% probability that

@farhatabbaslaw is the

author

Page 29: Twitter Author Prediction from Tweets using Bayesian Network

Predict Twitter Author

Initial corpus:

@dakwatuna: 3200 tweets

@farhatabbaslaw: 3172 tweets

Split into:

@dakwatuna

1000 training tweets

2200 test tweets

@farhatabbaslaw:

1000 training tweets

2172 test tweets

Page 30: Twitter Author Prediction from Tweets using Bayesian Network

Twitter Author Prediction Test:

@dakwatuna

Classification of 2200 tweets took 7855 ms

~ 3.57 ms per tweet classification

100% accuracy of prediction

Page 31: Twitter Author Prediction from Tweets using Bayesian Network

Twitter Author Prediction Test:

@farhatabbaslaw

Classification of 2172 tweets took 7353 ms

~ 3.38 ms per tweet classification

100% accuracy of prediction

Page 32: Twitter Author Prediction from Tweets using Bayesian Network

Conclusion

Initial results is promising

Bayesian Networks is able to predict tweet author with “very good” accuracy

Note that accuracy depends largely of:

Twitter author’s writing style

Twitter author’s topics of interest

Twitter author’s distribution of words

In other words, two different authors with similar writing style or topics will

have greater chance of “false positive” prediction