Page 1
RaghavaMukkamala andRaviVatrapu
CentreforBusinessDataAnalytics(bda.cbs.dk)DepartmentofITManagementCopenhagenBusinessSchool
Phone:+45-4185-2299Email:[email protected]
Web:http://www.cbs.dk/en/staff/rrmitm
Pre-ICISWorkshoponTextMiningasaStrategyofInquiryinInformationSystemsResearchSunday,11-December-2016,Dublin,Ireland
1
Page 2
Motivation
• AutomatedTextAnalysishasgainedprominenceasitcansubstantiallyreducesthecostsofanalyzinglargevolumesoftext
• NewBigSocialDataAnalyticstechniquesincreasingusingtextanalysisaspartoftheirmainstreamanalysis• duetoubiquitoususeofsocialmediaplatformsbymillionsofusers• hugeamountsofcontent(includingtext)accumulatedcontinuously
• Tofindouttheusers’opinions,emotionsetc.fromtextdocuments
2
Page 3
BigSocialDataAnalytics– OverallMethodology
[IEEEBigdataCongress2014],[IEEEAccess2016]
Page 4
TextClassification• Classification:Assigningtextdocuments*topredefinedcategories• Category:Asetoflabelsmodelingadomainspecificconcept• E.g.SentimentorEmotion1 Analysis• SocialInfluence2:Reciprocity,CommitmentandConsistency,SocialProof,Authority,Liking,Scarcity
• ClassifyingdocumentsintoKnownCategories:• DictionaryMethods(orLexicon-basedmodels)• SupervisedLearningMethods
• Discovering(Unknown)CategoriesandTopics• TopicModeling
41) Ekman,Paul."Anargumentforbasicemotions."Cognition&emotion 6.3-4(1992):169-200.2) Cialdini,RobertB.Influence.Vol.3.A.Michel,1987.
*Textdocument:unitoftext,whichcouldbeoneormorewords/sentences/paragraphs
Page 5
DictionaryMethods• DictionaryMethods:• Userate/proportionatwhichkeywordsappearindocuments• Usesalistofwordswithscorestofindoutthedocumentcategorylabel
• Boring:-1,Disgust:-3,inspire:+2,masterpiece:+5• Limitedtocategoriesforwhichdictionariesareavailable(Sentiment,Emotionetc.)• domainspecifici.e.accuracydependsondomainfromwhichwordsaretaken
• Usageofword{crude}in“crudeoil”vs“acrudejoke”• Validationofdictionariesisbithard(whenweusethem,wedon’tknowhowmuchaccuracywewillget)
5
Page 6
SupervisedLearningMethods
• SupervisedTextclassification• Usemanuallyencodedtrainingsets(documentswithlabels)bydomainexperts• Canbeusedwithanydomainspecificmodelorcategory• Moreaccurateresultsthandictionarybasedapproaches• Validationiseasyastheyprovideperformancemeasures• Drawback:Preparingtrainingsetsmightbeanexpensivetask
• Applications:Spamdetection,Age/genderidentification,Languageidentification,Sentimentanalysisandsoon
6
Page 7
AmbiguitymakesNLPhard
RealNewspaperheadlines• TeacherStrikesIdleKids• #1Theteacherisonstrike,whichidlesthekids.• #2Ateacherstrikeskidswhoareidle
• BanonNudeDancingonGovernor'sDesk• #1Banon[NudeDancingonGovernor’sDesk]• #2[BanonNudeDancing]onGovernor’sDesk
• Iftextcontainsambiguity,theclassificationsaccuraciesmayvary
7DanJurafsky andChristopherManning.NaturalLanguageProcessing(Coursera- StanfordUniversity)https://www.coursera.org/course/nlp
Page 9
MUTATO:Architecture
9
Domain Expert
Global Perspective documents: 21
Text Extraction
Text Preprocessing
Word Frequency Analysis
Collocation Analysis
Factors with search words
Keyword Analysis
Text corpus for Training set
Training Set with Labels
Keyword counts
Classifier Training
Classified Texts with Labels
TextAnalyst
Multi-dimensional Text Classification Tool
Text Mining/Topic Modeling,
Text Classification
Text Corpus (social data,
documents and etc)
Search words
Text
Domain Experts coding
training set
Training Data Set
with Models
Natural Language Toolkit (NLTK),
Gensim,Python, ASP.Net
Classification Performance Measures
Accuracy, Precision, Recall, F-Meaasure
Inter Coder Agreement
Inter-rater Agreement,
Cohen's Kappa
Performance Measures
Results
Keyword Analysis
Keyword Counts, Most prominent
Categories
Word Frequency Analysis
Most Frequent Words, Frequency
Distributions
Text Classification
Multi-label Domain Specific classified Texts
Collocation Analysis
Bigrams, Trigrams and N-grams
Models Topic Modeling
Discovering Topics and Categories
LDIC2016(BestPaperNomination)
Page 10
TextMining(Unsupervised)
• KeywordAnalysis:KeywordcountsusingNaturalLanguageToolkit(NLTK)• WordFrequencyAnalysis:Frequentoccurringwordsfromagiventextcorpus,byusingthetermdocumentmatrix.(e.g.Top100mostfrequentwords)• CollocationAnalysis:Collocationsareexpressionsofmultiplewords,whichcommonlyco-occurinthedocuments• providesinsightsaboutdocumentsbyprovidingbigrams,trigramandn-gramsthatcontainwords,whichco-occurinthedocuments.
10
Page 11
TopicModeling(Unsupervised)
• Toidentify/discovertopicsandinformationpatternsintext• Clusteringtechniquestogroupthewordsbasedonsimilaritydistances• ToolbasedonGensim1 library+Python
111)Gensim,topicmodelingforhumanshttps://radimrehurek.com/gensim/
Page 12
TextClassification• Input
• adocumentd• afixedsetofclassesC={c1,c2,…,cJ}• Atrainingset(ofsizem)hand-labeleddocuments{(d1,c1),....,(dm,cm)}
• Output:• alearnedortrainedclassifierγ:dàc|c∊C
• ClassifiersusingNaiv̈eBayesAlgorithm• Alternatives:Logisticregression,Support-vectormachines,NeuralNetworks
• Naiv̈eBayesClassifier• BasedonBayesruleofconditionalprobabilities• Bagofwordsapproach• Requiresmanuallycodedtrainingsetsbydomainexperts
12
Page 13
TrainingSets– ManualCoding
• SystematicapproachformanualcontentanalysissuggestedbyRebeccaMorris[1]• ReliabilityCohen’sKappavalue:
• po=0.16+0.31+0.41=0.88
• pc=(0.20×0.21)+(0.37×0.34)+(0.43×0.45)=0.362.
13
1.R.Morris,“Computerizedcontentanalysisinmanagementresearch:Ademonstrationofadvantages&limitations,”JournalofManagement,vol.20,no.4,pp.903–931,1994.2,3
Page 14
ModelTrainingTool
• https://textmining.cbs.dk/TextClassification/ClsssifyTextModels.aspx
14
Page 15
“Heres anidea.Ifyouliketheirfoodeatthere.Ifyoudont liketheirfoodeatsomewhereelseormakeyourownmeal.Ireallydont understandwhatthebigdealis.”
User Consumer
Organisation
SocialInfluence
Domain-SpecificClassifier#01:Marketing
Page 16
“Brazilianhighwaytransportshowcasesaseriesofpositivefeaturessuchasflexibility,availability,andspeed.However,whencomparedtoothermodes,itbearslimitationssuchaslowproductivity,lowenergyefficiency,andlowsafetyindices.
Domain-SpecificClassifier#02:OperationsResearch
Page 17
“Whatthispostissaying:Someobesepeopledon'tsufferfromType2Diabetes..Whatthispostisn'tsaying:Obesitydoesn'tcauseType2Diabetes..Youcanbehealthyandobese.”
Domain-SpecificClassifier#03:PublicHealth
Page 18
Classifier• UsingNaturalLanguageToolkitandPython
• CustomPythonscript(~1000lines)usedfortraining&classificationofthetexts
• MUTATO1.0automatesthewholeprocessasatool
18
Page 19
PerformanceMeasures
19
Page 20
ToolStatistics
20
• Languagessupported:English,Danish,Norwegian,Swedish,[Finnish]
• ClassificationDonefor• 20BDA/BSDAstudentprojectswithvarietyofdatasets:H&M,DanskeBank,Volkswagencrisis,Skavlan Talkshow,TV2Norwayetc.
• ~10Mastersthesisprojects:Patient-journey,Jabra-Classification,Skat data,SASvsNorwegianAirlines,TransportationLogistics,CouchsurfingFBdata
• ResearchArticles:12
Page 21
21
ResearchPublications:TextAnalytics
IEEEEDOC2014 IEEEBigData2016 IEEEBigData2016
Page 22
ResearchPublications:SocialMediaCrisis
22
IEEEBigDataCongress2015 IEEEEDOC2015 IEEEAccessJournal
Page 23
23ACMMindtrek 2016HICSS2016
ResearchPublications:Crowdfunding&Crowdsourcing
Page 24
24ICTH2016 IEEEHealthCom 2016
ResearchPublications:PublicHealth
Page 25
25LDIC2016(BestPaperNomination) WCTR2016
ResearchPublications:OperationsResearch
Page 26
FutureResearch
26
• TextSummarizationTechniquesforAsynchronousCommunication• Danish,NorwegianandFinnishLanguages• Discourseanalysisforasynchronouscommunication(suchasblogs,socialmedia)• BasedonHiddenMarkovmodels andgraphoptimizationtechniques• usingIntra-sententialRhetoricalParseTreeandaspect-baseddiscoursetrees
ThankYou [email protected] ,[email protected]