Top Banner
SEASR Lab: Text Mining High Performance Compu=ng in the Humani=es, Arts, and Social Science Workshop UIUC/NCSA July 28, 2008 LoreMa Auvil Na=onal Center for Supercompu=ng Applica=ons University of Illinois at Urbana Champaign
12

ICHASS Workshop Lab

Nov 28, 2014

Download

Education

Loretta Auvil

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ICHASS Workshop Lab

SEASRLab:TextMining

HighPerformanceCompu=ngintheHumani=es,Arts,andSocialScienceWorkshop

UIUC/NCSAJuly28,2008

LoreMaAuvil

Na=onalCenterforSupercompu=ngApplica=onsUniversityofIllinoisatUrbanaChampaign

Page 2: ICHASS Workshop Lab

SEASR

MeandreWorkbench

Page 3: ICHASS Workshop Lab

TextMining:ClusteringDefini=on•  Given:Setofdocumentsandasimilaritymeasure

amongdocuments•  Find:Clusterssuchthat

–  Documentsinoneclusteraremoresimilartooneanother

–  Documentsinseparateclustersarelesssimilartooneanother

•  Goal:–  Findingacorrectsetofdocuments

•  SimilarityMeasures:–  EuclideandistanceifaMributesarecon=nuous–  Otherproblem‐specificmeasures

•  e.g.,howmanywordsarecommoninthesedocuments

•  Evalua=on:WhatIsGoodClustering?–  Producehighqualityclusterswith

•  highintra‐classsimilarity•  lowinter‐classsimilarity

–  QualityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaMerns

Page 4: ICHASS Workshop Lab

TextClustering

•  Loadasingledocument•  Segmentthatdocumentapproximatelyevery250words(propertycanbeadjusted)

•  PartofSpeechTagging•  Termselec=onbyPartofSpeech

•  Clustereachsegmentbasedonsimilaritymetric

•  Createvisualiza=on

Page 5: ICHASS Workshop Lab

TextClusteringVisualiza=on

DendrogramconsistsofmanyU‐shapedlinesconnec=ngobjectsinahierarchicaltree.TheheightofeachUrepresentsthedistancebetweenthetwoobjectsbeingconnected

Page 6: ICHASS Workshop Lab

LabSession

•  OpenbrowserforMeandreWorkbenchtohMp://demo.seasr.org:1712

•  Login– userid:admin– Password:admin

– Server:demo.seasr.org– Port:1714

Page 7: ICHASS Workshop Lab

SEASR

MeandreWorkbench

RepositoryPanel

Workspace

DetailsPanel

Output

Page 8: ICHASS Workshop Lab

LoadanExis=ngflow

•  ClickonFlowsintheRepositoryPaneltoopenthesavedflows

•  DoubleClickonTextClustering2fromthislist

•  ClickonRunFlowtoexecutethisflowontheMeandreserver

Page 9: ICHASS Workshop Lab

DendrogramResults

•  Clickingona“cluster”showsitinblueanddisplaysthelistofwords

•  Displayshowsavgfreqofwordwithintheclusterandavgfreqoverallclusters

•  Onwindows,youcandoubleclicktodrillintoacluster

Page 10: ICHASS Workshop Lab

ChangingProper=esClickonacomponent,thendoubleclickinthePropertyPanel

oftheDetailsPanel•  PushString

–  string:pasteaurlforatextdocument(asciitextfile)•  TextSegmenta=on

–  segment_size:segngfornumberoftermsperdocument(segment)

•  HAC–  Distancemetric:segngformetric

•  FilterPOS–  tag_list:POStoincludeinthelistofterms

•  Nouns:NN,NNP,NNPS,NNS,NP,NPS•  Verbs:VB,VBD,VBG,VBN,VBP,VBZ•  Adjec=ves:JJ,JJR,JJS,JJSS•  Adverbs:RB,RBR,RBS

Page 11: ICHASS Workshop Lab

RunAgain

•  Changesomeproper=esandrerun…

Page 12: ICHASS Workshop Lab

Documenta=on

•  hMp://seasr.org/meandre/documenta=on/