Top Banner
TOPIC MODELING APPLIQUÉ AUX FILS TWITTERS. Alexis Perrier Data & Software, Berklee College of Music, Boston Data Science contributor @alexip @BerkleeOnline @ODSC
22

Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

Apr 13, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

TOPIC MODELINGAPPLIQUÉ AUX FILS TWITTERS.

Alexis Perrier

Data & Software, Berklee College of Music, Boston

Data Science contributor

@alexip

@BerkleeOnline

@ODSC

Page 2: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

Part I: Topic Modeling

Nature et applicationAlgos et Librairies

Part II: Projet: followers sur twitter

MethodesProblemesViz

Page 3: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

Sôrry pour les accents et anglicismes

Page 4: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

Vue générale et rapide sur un large ensemble dedocuments

Technique non-supervisée

1 document plusieurs topics1 topic un ensemble de motsLa proportion des topics varie entre les documents

⇔⇔

Page 5: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

ANALYSE SÉMANTIQUE DE COLLECTIONS DE DOCUMENTS

Divers CorpusLittératureJournauxDocuments o�cielsContenu en ligneRéseaux sociaux, forums, ....

Couplé a des variables externesEvolution dans le tempsAuteurs, locuteurs

Page 6: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

ALGORITHMES

Page 7: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

PRINCIPAUX ALGORITHMES

Approche vectorielle

Latent Semantic Analysis (LSA)

Approche probabiliste, Bayésienne

Latent Dirichlet Allocation (LDA)Structural Topic Modeling (STM), pLSA, hLDA, ...

Approche Neural Networks

convnets, ...

Page 8: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

LATENT SEMANTIC ANALYSIS - LSA

TF-IDF: Fréquence relative des mots => VectorisationMatrice document / fréquence des motsRéduction de dimensionDécomposition en Valeur Singulière (SVD)

aka Latent Semantic Indexing (LSI)

Page 9: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

LATENT DIRICHLET ALLOCATION

Un topic est une liste des probabilités des mots dans unvocabulaire donné.

LDA: La distribution des topics suit une loi de Dirichlet.

K: Nombre de topics: Nombre de topics par document: Nombre de mots par topicαβ

Details:https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Inférence bayésienne, Gibbs sampling, Chineserestaurant process

Page 10: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

LIBRAIRIES

Page 11: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

LIBRAIRIES

Python libraries

- Topic Modelling for HumansLDA Python library

R packages

a. lsa packageb. lda packagec. topicmodels packaged.

Java libraries: S-Space Package, MALLET

C/C++ libraries: lda-c, hlda c, ctm-c d, hdp

Gensim

stm package

Page 13: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

ETAPES:

1. Construire le corpus2. Appliquer les modeles3. Interpreter => Perplexité!

Page 14: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

CONSTRUIRE LE CORPUS

1. Obtenir les timelines des 700 followers de :

Un document correspond a une timeline

2. Vectoriser le document

bag-of-wordsTimeline en anglais: lang = 'en' +

: tokenize, stopwords, stemming, POS

3. TF-IDF

Creer un dictionnaire de motsVectoriser les documents TF-IDFGensim, NLTK, Scikit, ....

@alexipTwython

langidNLTK

Page 15: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

1) APPLIQUER LSA

Résultats pour le moins di�ciles a interpreter

Page 16: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

2) APPLIQUER LDA

Franchement mieux

u'0.055*app + 0.045*team + 0.043*contact + 0.043*idea + 0.029*quote + 0.022*free + 0.020*development + 0.019*looking + 0.017*startup + 0.017*build',u'0.033*socialmedia + 0.022*python + 0.015*collaborative + 0.014*economy + 0.010*apple + 0.007*conda + 0.007*pydata + 0.007*talk + 0.007*check + 0.006*anaconda',u'0.053*week + 0.041*followers + 0.033*community + 0.030*insight + 0.010*follow + 0.007*world + 0.007*stats + 0.007*sharing + 0.006*unfollowers + 0.006*blog',u'0.014*thx + 0.010*event + 0.008*app + 0.007*travel + 0.006*social + 0.006*check + 0.006*marketing + 0.005*follow + 0.005*also + 0.005*time',u'0.044*docker + 0.036*prodmgmt + 0.029*product + 0.018*productmanagement + 0.017*programming + 0.012*tipoftheday + 0.010*security + 0.009*javascript + 0.009*manager + 0.009*containers',u'0.089*love + 0.035*john + 0.026*update + 0.022*heart + 0.015*peace + 0.014*beautiful + 0.012*beauty + 0.010*life + 0.010*shanti + 0.009*stories',u'0.033*geek + 0.009*architecture + 0.007*code + 0.007*products + 0.007*parts + 0.007*charts + 0.007*software + 0.006*cryptrader + 0.006*moombo + 0.006*book',u'0.049*stories + 0.046*network + 0.044*virginia + 0.044*entrepreneur + 0.039*etmchat + 0.025*etmooc + 0.021*etm + 0.015*join + 0.014*deis + 0.010*today',u'0.056*slots + 0.053*bonus + 0.052*fsiug + 0.039*casino + 0.031*slot + 0.024*online + 0.014*free + 0.013*hootchat + 0.010*win + 0.009*bonuses',u'0.056*video + 0.043*add + 0.042*message + 0.032*blog + 0.027*posts + 0.027*media + 0.025*training + 0.017*check + 0.013*gotta + 0.010*insider'

Quels sont les topics?Combien de topics?

Page 17: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

BACK TO THE CORPUS

Nettoyage des documentsCompleter la liste des stopwords a la mainIdenti�er les anomalies: Robots, retweets, hastag, ...Ne garder que les �ls qui ont twitté récemment.

245 timelines

Visualization - LDAvis

Page 18: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

3) STRUCTURAL TOPIC MODELING

NLP: Tokenization, stemming, stop-words, ...Nommer les topics: plusieurs groupes de mots partopic exclusivité, fréquenceNombre de topic optimum: grid search + scoringIn�uence des variables externes

Page 19: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier
Page 20: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

STM: PRESIDENTIAL DEBATES

Primaires US6 debats: 2 democrates, 4 republicains1 document = un intervenant pendant un debat

Visualization - stmBrowser

Page 22: Topic modeling of Twitter followers - Paris Machine Learning meetup - Alex Perrier

Code & Data & Viz:

- https://github.com/alexperrier/datatalks/tree/master/twitter - https://github.com/alexperrier/datatalks/tree/master/debates - http://nbviewer.jupyter.org/github/alexperrier/datatalks/blob/master/twitter/LDAvis_V2.ipynb- http://alexperrier.github.io/stm-visualization/index.html

Ref:

- topic modeling http://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf- lda: http://ai.stanford.edu/~ang/papers/nips01-lda.pdf - pyLDAvis: https://github.com/bmabey/pyLDAvis - stm: http://scholar.princeton.edu/files/bstewart/files/stmnips2013.pdf - stm R: http://structuraltopicmodel.com/ - stmBrowser: https://github.com/mroberts/stmBrowser