Page 1
Stats170A:ProjectinDataScience
TextAnalysisandClassification
Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine
(AcknowledgementstoProfMarkSteyvers andProfSameerSingh,UCI,forvariousslides)
Page 2
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2
Reading,Homework,Lectures
• Referencereading:– Python
• http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
• Homework7– Dueby2pmWednesdaynextweek
• NextLectures– Today:textanalysisandclassification– Nextweek:discussionofprojects
Page 3
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3
Page 4
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4
Roomforimprovement….
Page 5
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5
PredictingtheSentimentofTweets
wow:othiswasincluded intheplaylist#awesome positive
Icouldnotbehappierwithmyliferightnow positive
HAPPYBIRTHDAYTOTHEPANDERIFICPANDAIK.@TheOrionSound positive
I'mtiredashellInevergetaoffdayduring theweekanymore negative
Themoviewasreallydullandstupid negative
@washingtonpost @silvajanes Howawful!!!!!! negative
“Documents” ClassLabels
Page 6
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6
PredictingtheScoresofMovieReviews
Ilikedthismovie.Welldone.Greatactinganddirection 4
Terrificmovie, sawit3times.LovedthescenesinMexico. 5
Boringashell.Whogetspaidtocreatethisstuff? 1
Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2
“Documents” Ratings
Page 7
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7
EntityRecognition
Ilikedthismovie.Welldone.Greatactinganddirection 4
Terrificmovie, sawit3times.LovedthescenesinMexico. 5
Boringashell.Whogetspaidtocreatethisstuff? 1
Notoneof thedirector’sbestefforts.TomCruisewasterrible. 2
“Documents” Ratings
Page 8
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8
ExamplesofUserLocationFieldsinTwitter
User location: Deutschland User location: Mountain View, CA User location: Florida, USA User location: United States User location: West Virginia, Appalachia User location: Columbia Mo User location: South Florida User location: Germany/Berlin User location: St Louis, MO User location: St. Louis User location: Central Virginia User location: United States��User location: Sea Holme (of Norse legend) Oz User location: California, USA User location: USA User location: Cambodia User location: San Antonio, TX User location: Between my ears User location: Chicago, IL
Page 9
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9
Page 10
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10
Page 11
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11
BasicConceptsinTextRepresentation
Page 12
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12
RepresentingTextDocumentsNumerically
• Howdowerepresenttext(strings)forinputtomachinelearningalgorithms?(whichgenerallyrequirenumericrepresentations)
• Definea vocabulary=setofwordsorterms
• Simplemethod:BagofWords– Document representedasavectorofcountsofwordsinthevocabulary
• Morecomplex:Real-ValuedEmbeddings– Document (orword) representedasareal-valuedvectorin“embedding space”
• Evenmorecomplex:SequentialRepresentations– Sequential state-machinemodelfor sequencesofwords
Page 13
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13
ExampleofBag-of-WordsMatrix
database SQL index calculus derivative function Label
d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2
Page 14
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14
ConceptsandTerminology
• Document:– Collectionofwords
• Abook,anewsarticle,areport,aWebpage,anemail,atweet,etc– Maycontainboth textandmetadata.– Examplesofmetadata:authorname(s),date,wherepublished, etc
Notethatthedefinition ofadocument isflexiblee.g.,abookcouldbeasingledocument,or…..
eachsectionofabookcouldbeconsidereda“document”
• Corpus:acollectionofdocuments– e.g.,allnewsarticlesfromtheLosAngelesTimessince1990– e.g.,allWikipediaWebpages– e.g.,allYelpreviewsfor restaurantsinChicago– e.g.,arandomsampleofTweetsfromDec2017
Page 15
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15
ConceptsandTerminology
• Tokens:– Groupings ofcharactersintherawtext– individualwords (or“wordtypes”)+possiblynumbers,punctuation, etc
• WordsorTerms– Uniquetokens (orcombinationsof tokens)thataremeaningful– e.g.,wordssuchas“cat”,“dog”,termssuchas“U.S”or“92697”– Canalsoincludebigrams(“SanDiego”), trigrams(“NewYorkCity”),etc
• Vocabulary– Thespecificsetofunique termsusedbyanalgorithm orapplication– TheEnglish languagehasorderof1millionuniquewords– Inaparticularapplicationwemightuseavocabularyofonly10kto50kterms
• e.g.,relevant/commonwords(unigrams)• Bigram,trigrams,…,ngrams,canalsobepartofthevocabulary
Page 16
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16
TypicalPipelineforBuildingaTextClassifier
Tokenization
StopWordRemoval
VocabularyDefinition
CreateBagofWords
BuildTextClassifier
Document(string)
Notethatstepsinthepipeline,andhowtheseareimplemented (e.g.,howtokenization isdone)willvaryfromapplicationtoapplicationdepending ontheproblem
Page 17
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 17
ExampleofaTextAnalysisPipeline(forInformationExtraction)
Fromhttp://www.nltk.org/book/ch07.html
Page 18
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18
Example
Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’
Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)
Page 19
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19
Example
Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’
Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)
Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}
Page 20
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20
Example
Document=‘Chapter1:TheBeginning.Inthebeginning,lifewastough!..........’
Tokens={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘the’,‘beginning’,‘life’,‘was’,…}(Punctuation andwhitespacesareusually ignored)
Vocabulary={‘chapter’,‘1’,‘the’,‘beginning’,‘in’,‘life’,‘was’,‘tough’,…}
BagofWords={[‘chapter’,1],[‘1’,1],[‘the’,2],[‘beginning’,2],…}
Page 21
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21
?
Tokenization
l Splituptextintoindividualtokens(words/terms)
l Simplestapproachistoignoreallpunctuationandwhitespaceanduseonlyunbrokenstringsofalphabeticcharactersastokens
If you had a magic potion I’d love to have it.
If you had a magic potion I’d love to have it
Page 22
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22
IssuesinTokenization
• Finland’s capital → Finland Finlands Finland’s ?
• what’re, I’m, isn’t → What are, I am, is not
• Hewlett-Packard → Hewlett Packard ?
• state-of-the-art → state of the art ?
• Lowercase → lower-case lowercase lower case ?
• San Francisco → onetokenortwo?
• m.p.h., PhD. → ??
Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin
Page 23
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23
TokenizationSoftware
• Insteadofwritingyourowntokenizerwithacomplexsetofrules,useexistingsoftware– e.g.,tokenizer function inNLTK– e.g.,tokenizer fromStanford’snaturallanguagegroup
• Practicaltip:– Itsusefultokeepdifferent representations ofthedatatouselateron,e.g.,
• Datawithoriginalsequenceandformatting,tokenizedlist,bagofwords,etc
– Sequentialorderofwordsisneeded fordetectingngrams.– Punctuationcancontainuseful information:
If you had a magic potion I’d love to have it . If that makes sense
Ifwedecidetoextractn-gramslateron,weknowthat“it”and“if”shouldnotbecombined.So itsusefultoretainthisinformation.
Page 24
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24
SentenceDetection:ExampleusingaDecisionTree
Fromhttps://web.stanford.edu/~jurafsky/slp3/SpeechandLanguageProcessing,3rd ed,Jurafsky andMartin
Page 25
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25
Example:TokenizationofTweets
Therearespecial-purposetokenizers fordifferenttypesoftext,e.g.,forTweets
from nltk.tokenize import TweetTokenizerfrom nltk.tokenize import word_tokenize
tknzr = TweetTokenizer()
for i in range(10): print("NLTK Twitter Tokenizer: ", tknzr.tokenize(tweets.text[i]) ) print("NLTK Standard Tokenizer:", word_tokenize(tweets.text[i]))
Page 26
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26
NLTK Twitter Tokenizer: ['RT', '@ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who', '…'] NLTK Standard Tokenizer: ['RT', '@', 'ProudResister', ':', 'Trump', 'to', 'Comey', ':', 'Let', 'Flynn', 'go', '.', 'Trump', 'to', 'Russians', ':', 'Firing', 'Nut', 'Job', 'Comey', 'took', 'great', 'pressure', 'off', 'me', '.', 'Trump', 'to', 'McCabe', ':', 'Who…']
NLTK Twitter Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https://t.co/olA35uCt89'] NLTK Standard Tokenizer: ['Mueller', 'to', 'interview', 'former', 'spokesman', 'of', 'Trump', 'legal', 'team', ':', 'source', 'https', ':', '//t.co/olA35uCt89']
NLTK Twitter Tokenizer: ['RT', '@IndivisibleTeam', ':', "Here's", 'what', 'you', 'need', 'to', 'know', 'about', '#ReleaseTheMemo', '�', '1', '⃣', 'It', '’', 's', 'not', 'really', 'a', 'memo', '.', "They're", 'talking', 'points', 'by', 'Devin', 'Nunes', '…'] NLTK Standard Tokenizer: ['RT', '@', 'IndivisibleTeam', ':', 'Here', "'s", 'what', 'you', 'need', 'to', 'know', 'about', '#', 'ReleaseTheMemo�', '1⃣', 'It’s', 'not', 'really', 'a', 'memo', '.', 'They', "'re", 'talking', 'points', 'by', 'Devin', 'Nunes…']
SampleOutputfromthe2Tokenizers
Page 27
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27
Example:DefiningaVocabulary
Rawtext(astringinPython) raw1 = “The dog chased the cat and the mouse. Why did the dog do this?”
Thereare14wordtokens inthestringraw1(ifweignorepunctuationandspaces)The,dog,chased,the,cat,and,the,mouse,Why,did,the,dog, do,this
Thevocabulary (theunique tokens,normalizing tolowercase)is:the,dog,chased,cat,and,mouse,why,did,do, thisThevocabularysizeis10.
Thecounts forabagofwordsrepresentation is:the(3),dog (2), chased(1), cat(1),and(1),mouse(1),why(1),did(1),do(1), this(1)
Ifweremovestopwords wedecreaseourvocabularysizeto4dog (2),chased(1), cat(1),mouse(1)
Page 28
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28
DefiningtheVocabulary(afterTokenization)
• Vocabulary– Setof terms(words) usedtoconstructthedocument-termmatrix
• Basicapproach:usesinglewords(unigrams)asterms
• Removeverycommonterms(e.g.,stopwords)
• Removeveryrareterms:e.g.,removealltermsthatoccurinfewerthanKdocumentsinthecorpus(e.g.,K=10)– Getsridofmisspellings, unusualnames,etc
• Canextendtermlistwithn-grams– Frequentwordcombinations (2-grams,3-grams,…)
“feelgood”/“NewYorkCity”
Page 29
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29
FrequencyofWordinEnglishusage
RankofWordbyFrequency
Verylong“tail”ofrare
words
Graphfromwww.prismnet.com/~dierdorf/wordfrequency.png
Page 30
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30
Stopword Removal
l Removewords thatarelikelytobeirrelevanttoourprobleml Keepcontentwords(typically verbs,adverbs,nouns,adjectives)
If you had a magic potion I’d love to have it . If that makes sense
If you had a magic potion I’d love to have it . If that makes sense
Example:
But what about this?
[Prince Hamlet] To be or not to be ...
Page 31
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31
NLTKStopWord List
Note:inmanyapplicationstheremaybeadditionaldomain-specific“stopwords”thatareverycommonandthatwemaywanttoremovesincetheycontainlittleinformation,e.g.,theterm“restaurant”inreviews
Page 32
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32
N-grams
• Usefuln-gramsaregroupsofn-wordsthatcommonlyappearinsequence– “ComputerScience”– “LosAngeles”– “NewYorkCity”
– Notethatwecanhavebothtermslike“computer”and“computer science”,e.g.,
“Iamstudying computer scienceatUCIandIboughtanewcomputer lastweek.”
- Addingann-gramasaterminthevocabularymaybeusefulinamodel- e.g.,forarestaurantreviewclassificationproblem, “waittime”maybemore
informative than“time”
Page 33
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33
AutomaticallyFindingN-grams
- e.g.,Bigrams- Keeptrackofalluniquepairsthatoccursequentially(excludestopwords)
- I visited Microsoft Research for a computer science job interview last week.”
- * visited Microsoft Research * * computer science job interview last week.”
- Rankbyfrequencyofoccurrence(ornumberofdocsabigramappearsin)- Canalsorankbyothermetrics
- KeepthetopKinthevocabulary- Howlargeshould Kbe?Dependsontheapplication,amountofdata,etc- MightneedtosearchoverdifferentvaluesofK(e.g.,usingcross-validation)
- Sameideafortri-grams,4grams,etc.
- Seealsohttp://www.nltk.org/howto/collocations.html inNLTK
Page 34
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34
ExampleofBag-of-WordsMatrix
database SQL index calculus derivative function Label
d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2
Page 35
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35
ExampleofBag-of-WordsMatrix
database SQL index calculus derivative function Label
d1 24 21 9 0 0 3 1d2 32 10 5 0 3 0 1d3 12 16 5 0 0 0 1d4 6 7 2 0 0 0 1d5 43 31 20 0 3 0 1d6 2 0 0 18 7 16 2d7 0 0 1 32 12 0 2d8 3 0 0 22 4 2 2d9 1 0 0 34 27 25 2d10 6 0 0 17 4 23 2
Uselabeledtrainingdata(supervisedlearning)tolearnaclassifier– HomeworkAssignment7
Page 36
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36
OverviewofTextClassification
Page 37
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37
TextClassification
• Textclassificationhasmanyapplications– Spamemaildetection– Classifyingnewsarticles,e.g.,Google News– ClassifyingWebpagesintocategories
• DataRepresentation– “Bagofwords/terms”mostcommonlyused:eithercountsorbinary– Canalsouseotherweighting andadditional features(e.g.,metadata)
• ClassificationMethods– NaïveBayeswidelyusedbaseline
• Fastandreasonablyaccurate– LogisticRegression
• Widelyusedinindustry,accurate,excellentbaseline– Neuralnetworksanddeep learning
• Canbeveryaccurate• Canrequireverylargeamountsoflabeledtrainingdata• Morecomplexthanothermethods
Page 38
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38
ExampleofDocumentbyTermMatrix
predict finance stocks goal score team ClassLabel
d1 3 1 4 0 0 1 1
d2 2 2 1 0 1 0 1
d3 1 1 2 0 0 0 1
d4 1 2 0 0 0 0 1
d5 3 1 1 0 1 0 1
d6 1 0 0 1 3 1 2
d7 0 0 1 4 1 0 2
d8 2 0 0 1 2 1 2
d9 1 0 0 2 1 2 2
d10 1 0 0 1 3 1 2
Page 39
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39
PossibleWeightsforaLinearClassifier
predict finance stocks goal score team ClassLabel
Weight 0.1 2.5 1.0 -2.0 -0.2 -1.5
d1 3 1 4 0 0 1 1d2 2 2 1 0 1 0 1d3 1 1 2 0 0 0 1d4 1 2 0 0 0 0 1d5 3 1 1 0 1 0 1d6 1 0 0 1 3 1 2d7 0 0 1 4 1 0 2d8 2 0 0 1 2 1 2d9 1 0 0 2 1 2 2d10 1 0 0 1 3 1 2
Page 40
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40
RealExamplefromYelpData
YelpDatasetNumberofReviews 706,693NumberofReviewsw/oNeutralRating 595,468
NumberofTokens 85,392,376VocabularySizew/oStopwords 176,114
ArrayDimensions (595468, 176114)
Numberofcellsin theArray 104,870,251,352Non-zeroentries 28,357,001Density 0.0027%
Page 41
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41
ExampleofaPipelineforDocumentClassification
TrainingDocuments(corpus)
Tokenization ListsofTokens
Vocabulary
BagofWords
MachineLearningAlgorithm
Stopword andrarewordremoval
FrequencyCounts
DocumentClassifier
Page 42
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42
ExampleofaPipelineforDocumentClassification
TrainingDocuments(corpus)
Tokenization ListsofTokens
Vocabulary
BagofWords
MachineLearningAlgorithm
NewDocument
DocumentClassifier
ListsofTokens
Tokenization BagofWords
LabelPrediction
Stopword andrarewordremoval
FrequencyCounts
Page 43
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43
KeyStepsinDocumentAnalysisPipelines(forBagofWords)
• Tokenization– Variousoptions (e.g.,withpunctuation, nonalphanumericsymbols, etc)
• Vocabularydefinition– N-grams,stopword removal,rarewordremoval, stemming
• Featuredefinition– Binary(termpresentornot?)– Counts– Weightedcounts,e.g.,TD-IDF(seelaterintheslides)
• Classifierselection– NaïveBayes,logistic,SVMs,neuralnetworks,etc
Page 44
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44
TF-IDFWeightingofFeatures
Inpracticetheinputscanbeweighted– Itcanbehelpful touse“TF-IDFweights”insteadofcounts
TF(t,d)=termfrequency=count=numberoftimestermtoccursindocd
IDF(t,d)=inversedocumentfrequency=log(N/numberofdocswithtermt)
(whereN=totalnumberofdocsinthecorpus)
TF-IDF(t,d)=TF(t,d)*IDF(t,d)
TheIDFtermhastheeffectofupweighting termsthatoccurinfewdocs
Page 45
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45
TF-IDFExample
N=1000inacorpusofnewsarticles
Term1:t=“city”,appearsin500documents
IDF(t)=log(1000/500)=log(2)=1(logisbase2,not important)
Term2:t=“freeway”,appearsin10documents
IDF(t)=log(1000/10)=log(100)=6.64
Sooccurrencesof“freeway”willgetupweighted byafactorof6.64comparedtooccurrencesof“city”
Page 46
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46
ClassificationMethodsforText
• Logisticregression– widelyusedforbag-of-words– Inputdimensionality (vocabulary size)ishigh(e.g.,5kto500k)– regularization isuseful
• Recurrentneuralnetworksarewidelyusedforsequentialmodels(seelater)– Morecomplex,butcanpickupsequential information
Page 47
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html
Page 48
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48
From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html
Page 49
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49
From:http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html
Page 50
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50
From:https://github.com/TeamHG-Memex/eli5
Page 51
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51
SentimentAnalysis
Page 52
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52
Veryusefultutorial,manypracticaltipsandideas
Onlineathttp://sentiment.christopherpotts.net/
Page 53
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53
Example:PredictingifReviewTextisPositiveorNegative
• Twobasicapproaches– Uselexiconsofpositiveandnegativewords– Uselabeleddatatolearnaclassificationmodel(canalsocombinebothapproaches)
• Challenges,e.g.,howtohandlenegation?– Positivereview:
• “ThismovieisnotdepressingthewayIthought itwouldbe.Idon’t thinkIhaveanythingnegative tosayabout it.”
• “Iwasn’t disappointed atallwiththeacting – infacttheactingwasexcellent.”
– Negativereviews:• “Sittingthroughthismoviewasnot somethingIenjoyed doing”• “Thiswasn’t averygoodscript.Thedirectorisusuallyexcellent (somereallygreat
moviesovertheyearsthatIreallyenjoyed),but thisleavesmecold”
Page 54
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54
NegationMarking(duringTokenization)
• Idea:appenda_NEGsuffixtoeverywordappearingbetweenanegationandaclause-levelpunctuationmark
Fromhttp://sentiment.christopherpotts.net/lingstruc.html
Page 55
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55
Fromhttp://sentiment.christopherpotts.net/lingstruc.html
ExamplesofNegationMarking
Page 56
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56
ImprovementsinClassificationAccuracy
Fromhttp://sentiment.christopherpotts.net/lingstruc.html
Page 57
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 57
SentimentLexicons(Dictionaries)
• Counts,percentages,weightsofwordsinvariouscategories,– e.g.,degree towhichawordtendstopositiveornegative
• ExampleLexicons– Generalinquirer (http://www.wjh.harvard.edu/~inquirer)
• WordscategorizedaccordingtoPositive/Negative,StrongvsWeak,ActivevsPassive,etc
– Sentiwordnet (http://sentiwordnet.isti.cnr.it/)• Synsets inWordNet3.0 annotatedfordegreesofpositivity,negativity,and
neutrality/objectiveness
– LinguisticInquiryandWordCount(http://www.liwc.net/)
– NRCWord-Emotion AssociationLexicon(nextslide)
Page 58
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 58
Page 59
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 59
YelpChallengeDataSet
Page 60
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 60
Jupyter NotebookExamplewithYelpDataset
Page 61
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 61
VectorRepresentationsforText
Page 62
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 62
VectorRepresentationsallowGeneralization
• Example:Cat : [1.5,1.5,-0.2,0.0,0.0,0.0]Kitten: [1.4,1.5,-0.2,0.0,0.0,0.0]Pet: [2.0,1.6,-0.3,0.0,0.0,0.0]
• Iftheword“cat”occursinadocumentthenweknowthatotherwordslike“kitten”and“pet”aresimilar
• Whyisthisusefulinlearning?– Consideraclassifierbasedonweights(e.g.,logisticregression,neuralnetwork)– Traditionalapproach:eachwordhasitsownseparateweight– Ifwerepresentwordsbytheirvectors,andlearnweightson thevectors,then
wordsthatareclosetogetherwillproducesimilaroutputs– Thiscanhelpwithgeneralization
Page 63
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 63
Page 64
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 64
Page 65
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 65
AnotherExampleofWordEmbedding
Page 66
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 66
Publicly-AvailablePre-TrainedWordEmbeddings
• Word2Vec:https://code.google.com/archive/p/word2vec/– GoogleNewsdataset,3Mvocab(wordsandphrases), from100Btokens– Entityvectors:1.4Mvectorsforentities, trainedon100Bwords fromnews
articles,entitynamesfromFreebase
• Glove:http://nlp.stanford.edu/projects/glove/– Wikipedia:400kvocab,50dto300dvectors,basedon6Btokens– CommonCrawl,2.2Mvocab,300dvectors, from840Btokens– Twitter:1.2Mvocab,25dto200dvectors,basedon2Btweets
• Note:youmaywanttoconsiderusingtheseinyourprojects– Easierandfasterthantrainingyourownembeddingmodel
Page 67
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 67
Component1
Component2
x
y
z
Inthisexample,y andz aremoresimilarundercosinesimilaritythany andx becausetheanglebetweeny andz ismuchsmaller
CosineDistancebetweentwoVectors
Cosinedistance=1- Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0
Page 68
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 68
ExamplesofSimilaritybetweenWordsExamplesfromhttps://code.google.com/archive/p/word2vec/
Cosinesimilarityto“France”
Cosinesimilarityto“SanFrancisco”
Cosinesimilarity:measurescosineoftheanglebetween2vectors, wherecos(0)=1,andcos(90)=0
Page 69
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 69
1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.68
MostSimilarVectorsto‘cat’fromword2vec300dembeddings
Rank Word CosineSimilarity
Page 70
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 70
1:cat 1.002:cats 0.813:dog 0.764:kitten 0.755:feline 0.736:beagle 0.727:puppy 0.718:pup 0.699:pet 0.6910:felines 0.6811:chihuahua 0.6712:pooch 0.6713:kitties 0.6714:dachshund 0.6715:poodle 0.6616:stray_cat 0.6617:Shih_Tzu 0.6618:tabby 0.6619:basset_hound 0.6520:golden_retriever 0.65
MostSimilarVectorsto‘cat’fromword2vec300dembeddings
Rank Word CosineSimilarity
Page 71
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 71
1:oxycontin 1.002:Oxycontin 0.793:Oxycodone 0.714:OxyContin 0.695:morphine_pills 0.676:hydrocodone 0.677:Lortab 0.678:oxycodone 0.659:OxyContin_pills 0.6510:Hydrocodone 0.6511:Alprazolam 0.6412:Oxycodone_pills 0.6313:Xanax 0.6214:Lorcet 0.6215:Percocet_pills 0.6116:Valium 0.6117:Diazepam 0.6018:methadone_pills 0.6019:hydrocodone_pills 0.6020:prescription_pills 0.59
MostSimilarVectorsto‘oxycontin’fromword2vec300dembeddings
Rank Word CosineSimilarity
Page 72
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 72
1:iran 1.002:pakistan 0.663:israel 0.644:lebanon 0.625:russia 0.626:iraq 0.617:egypt 0.608:Irans 0.599:afghanistan 0.5910:Iran 0.5911:arabs 0.5812:israeli 0.5713:obama 0.5614:arab 0.5615:iraqi 0.5616:japan 0.5617:sri_lanka 0.5618:america 0.5519:korea 0.5520:cuba 0.55
MostSimilarVectorsto‘iran’fromword2vec300dembeddings
Rank Word CosineSimilarity
Page 73
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 73
1:Iran 1.002:Tehran 0.833:Iranian 0.824:Islamic_republic 0.815:Islamic_Republic 0.816:Iranians 0.787:Teheran 0.768:Syria 0.759:Ahmadinejad 0.6910:Irans 0.6711:Larijani 0.6612:North_Korea 0.6413:Mottaki 0.6314:clerical_regime 0.6315:nuclear_ambitions 0.6316:Khamenei 0.6317:Ayatollah_Khamenei 0.6318:Velayati 0.6319:Ahmedinejad 0.6320:Rafsanjani 0.62
MostSimilarVectorsto‘Iran’fromword2vec300dembeddings
Rank Word CosineSimilarity
Page 74
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 74
RecurrentNeuralNetworksforSequentialData
Page 75
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 75
Page 76
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 76
NetworkModelforPredictingtheNextWordinaSentence
thedogsawcatonwall
Sentence:Thedogsawthecatonthewall.
.
1000000
0100000
h1
h2
Binaryinput,“One-hot”encoding
thedogsawcatonwall.
Hiddenunitactivations(real-valued)
Binarytargetoutput,
“One-hot”encoding
Page 77
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 77
StandardRecurrentNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Differenttoa“feedforward”networkHiddenunitatpositiontnowalsoHasinputfromhiddenvectorattimet-1
Page 78
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 78
StateComputationinaNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Page 79
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 79
StateComputationinaNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Keypoints:- fw ineachpositionistheRNNstate- Thisstateisupdatedrecursivelyfromthe
previousstateandthecurrentinput
Page 80
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 80
StateComputationinaNeuralNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Keypoints:- Thex->fandf->harrowsareweightmatrices- Thesameweightmatricesare(usually)used
acrossthesequence
Page 81
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 81
LearningtheWeightsinaRecurrentNetwork
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Learning:- lossfunctioncomparespredictionsandtargets- computegradient(basedonerror)andupdateweights
Page 82
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 82
Example:PredictionofSequencesofCharacters
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Page 83
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 83
SimulatingSequencesfromanRNN
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Page 84
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 84
SimulatingSequencesfromanRNN
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Page 85
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 85
SimulatingSequencesfromanRNN
Figuresfromhttp://cs231n.stanford.edu/slides/2017/
Page 86
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 86
OutputfromanRNNModelTrainedonShakespeare
KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour.
Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.
DUKE VINCENTIO: Well, your wit is in the care of side and that.
Examplesfrom“TheUnreasonableEffectiveness ofRecurrentNeuralNetworks”,AndrejKaparthy,blog,http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Page 87
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 87
OutputfromanRNNModelTrainedonCookingRecipes
Fromhttps://gist.github.com/nylki/1efbaa36635956d35bcc
Page 88
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 88
Imageclassification
Imagecaptioning
Sentimentanalysis
Machinetranslation
Syncedsequence(videoclassification)
DifferentRecurrentNetworkModels
Page 89
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 89
Part-of-SpeechTagging
Page 90
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 90
PartofSpeech(POS)Tagging
• CommonPOScategories(ortags)inEnglish:– Noun,verb,article,preposition, pronoun, adverb, conjunction, interjection
• Howevertherearemanymorespecializedcategories– E.g.,propernouns:e.g.,‘Toronto’, ‘Smith’,….– E.g.,comparativeadverb:e.g.,‘bigger’, ‘smaller’,…– E.g.,symbol: ‘3.12’,‘$’,…
• AssigningPOScategoriestowordsintextisknownastagging
Page 91
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 91
UniversalTagset (asusedinNLTK)
12universaltags:VERB- verbs(alltensesandmodes)NOUN- nouns(commonandproper)PRON- pronounsADJ- adjectivesADV- adverbsADP- adpositions (prepositionsandpostpositions)CONJ- conjunctionsDET- determinersNUM- cardinalnumbersPRT- particlesorotherfunctionwordsX- other:foreignwords,typos,abbreviations.- punctuation
See"AUniversalPart-of-SpeechTagset"bySlavPetrov,Dipanjan DasandRyanMcDonald formoredetails:http://arxiv.org/abs/1104.2086andhttp://code.google.com/p/universal-pos-tags/
Page 92
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 92
UniversalTagset (usedinNLTK)
Tag Meaning English ExamplesADJ adjective new, good, high, special, big, localADP adposition on, of, at, with, by, into, underADV adverb really, already, still, early, nowCONJ conjunction and, or, but, if, while, althoughDET determiner, article the, a, some, most, every, no, whichNOUN noun year, home, costs, time, AfricaNUM numeral twenty-four, fourth, 1991, 14:24PRT particle at, on, out, over per, that, up, withPRON pronoun he, their, her, its, my, I, usVERB verb is, say, told, given, playing, would. punctuation marks . , ; !X other ersatz, esprit, dunno, gr8, univeristy
(fromSection2.3inChapter5ofNLTKBook)
Page 93
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 93
POSTaggingAlgorithms
• TaggingAlgorithms:“POSTaggers”– Tagging isoftendoneautomaticallywithalgorithms– Thesealgorithms oftenusesequential(Markov)models
• Thetagforaparticulartokencandependonwords/tagsbeforeandafterit– Thesemodelsaretrainedusingmachinelearning
• usingvariouswordfeaturesanddictionariesasinput– Trainedonmanuallylabeleddocuments
• Taggingperformanceisbestonmaterialthatthetaggerwasoriginallytrainedon(oftennewsdocuments)
• Tagscanbehelpful“downstream”forvariousapplications– E.g.,fordocumentclassificationwemightwanttoonlyusenouns, adjectives,
andverbsandignoreeverythingelse– E.g.,forinformation extractionwemight focusonlyonnouns
Page 94
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 94
ChallengesinPOSTagging
• TaggingwordsintextwiththeircorrectPOStagsisnotsimplyassigningwordstotagsusingalookuptable
• Semanticcontext– Thenegotiatorwasabletobridgethegapbetweenthe2sides– Here‘bridge’ isusedasaverbeventhough weordinarily thinkofitasanoun– Theotherwordsinthesentenceandthegrammaticalstructureallowusto
interpret ‘bridge’ hereasaverb
• Ambiguity,e.g.,– Thepresidentwasentertaininglastnight
• Boththeadjectiveandverbtagfor“entertaining”workhere,i.e.,thereisambiguity
• Tokenizationissues– ThealgorithmmustbeabletodealwithtokenssuchasI’ld or‘pre-specified’
Page 95
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 95
SoftwareforPartofSpeechTagging
• Manysoftwarepackagesandonlinetoolsavailable
• NLTKPOSTagger
• StanfordnaturallanguagegroupprovidesexcellentPOStaggersinseverallanguages(English,German,Chinese,French,etc)
• http://nlp.stanford.edu/software/tagger.shtml• UsesthePennTreebanktagset
• Onlinedemo– http://demo.ark.cs.cmu.edu/parse
Page 96
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 96
Example:RemovingStopWordswithaPOSTagger
• Filteroutanytermthatdoesnotbelongtoa(user-defined)targetsetofPOSclasses
SPEAKER WORD POSTAGPATIENT if IN PrepositionPATIENT you PRP Personal PronounPATIENT had VBD Verb, Past Tense PATIENT a DT DeterminerPATIENT magic JJ AdjectivePATIENT potion NN NounPATIENT i PRP Personal PronounPATIENT ‘d MD ModalPATIENT love VB Verb, base formPATIENT to TO PATIENT have VB Verb, base formPATIENT it PRP Personal Pronoun
Examplerule:extractadjectivesandnounsonly