SEASR Lab: Text Mining High Performance Compu=ng in the Humani=es, Arts, and Social Science Workshop UIUC/NCSA July 28, 2008 LoreMa Auvil Na=onal Center for Supercompu=ng Applica=ons University of Illinois at Urbana Champaign
SEASRLab:TextMining
HighPerformanceCompu=ngintheHumani=es,Arts,andSocialScienceWorkshop
UIUC/NCSAJuly28,2008
LoreMaAuvil
Na=onalCenterforSupercompu=ngApplica=onsUniversityofIllinoisatUrbanaChampaign
SEASR
MeandreWorkbench
TextMining:ClusteringDefini=on• Given:Setofdocumentsandasimilaritymeasure
amongdocuments• Find:Clusterssuchthat
– Documentsinoneclusteraremoresimilartooneanother
– Documentsinseparateclustersarelesssimilartooneanother
• Goal:– Findingacorrectsetofdocuments
• SimilarityMeasures:– EuclideandistanceifaMributesarecon=nuous– Otherproblem‐specificmeasures
• e.g.,howmanywordsarecommoninthesedocuments
• Evalua=on:WhatIsGoodClustering?– Producehighqualityclusterswith
• highintra‐classsimilarity• lowinter‐classsimilarity
– QualityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaMerns
TextClustering
• Loadasingledocument• Segmentthatdocumentapproximatelyevery250words(propertycanbeadjusted)
• PartofSpeechTagging• Termselec=onbyPartofSpeech
• Clustereachsegmentbasedonsimilaritymetric
• Createvisualiza=on
TextClusteringVisualiza=on
DendrogramconsistsofmanyU‐shapedlinesconnec=ngobjectsinahierarchicaltree.TheheightofeachUrepresentsthedistancebetweenthetwoobjectsbeingconnected
LabSession
• OpenbrowserforMeandreWorkbenchtohMp://demo.seasr.org:1712
• Login– userid:admin– Password:admin
– Server:demo.seasr.org– Port:1714
SEASR
MeandreWorkbench
RepositoryPanel
Workspace
DetailsPanel
Output
LoadanExis=ngflow
• ClickonFlowsintheRepositoryPaneltoopenthesavedflows
• DoubleClickonTextClustering2fromthislist
• ClickonRunFlowtoexecutethisflowontheMeandreserver
DendrogramResults
• Clickingona“cluster”showsitinblueanddisplaysthelistofwords
• Displayshowsavgfreqofwordwithintheclusterandavgfreqoverallclusters
• Onwindows,youcandoubleclicktodrillintoacluster
ChangingProper=esClickonacomponent,thendoubleclickinthePropertyPanel
oftheDetailsPanel• PushString
– string:pasteaurlforatextdocument(asciitextfile)• TextSegmenta=on
– segment_size:segngfornumberoftermsperdocument(segment)
• HAC– Distancemetric:segngformetric
• FilterPOS– tag_list:POStoincludeinthelistofterms
• Nouns:NN,NNP,NNPS,NNS,NP,NPS• Verbs:VB,VBD,VBG,VBN,VBP,VBZ• Adjec=ves:JJ,JJR,JJS,JJSS• Adverbs:RB,RBR,RBS
RunAgain
• Changesomeproper=esandrerun…
Documenta=on
• hMp://seasr.org/meandre/documenta=on/