Top Banner
University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012
33

University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

Mar 31, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Tools for Unstructured TextNICAR 2012

Loretta Auvil, UIUCChase Davis, CIR

February 2012

Page 2: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Hottest Analytics

Page 3: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Overview

Page 4: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Sites that List Tools

• DiRT, Digital Research Tools• http://digitalresearchtools.pbworks.com

• Kdnuggets• http://www.kdnuggets.com/software/text.html

• text-processing.com

• Discussion Groups• Text Analytics on linkedin

• http://www.linkedin.com/groups/Text-Analytics-115439• Visual Analytics on linkedin

• http://www.linkedin.com/groups/Visual-Analytics-80552

Page 5: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Bad News

• There is no open and freely available tool that is going to solve all your problems!!!

Page 6: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Good News

• There is a variety of tools that can be beneficial and must be used in combination to accomplish the goal!!!

Page 7: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Natural Language Processing (NLP)• Tokenization

• Part of speech tagging

• Stemming

• Stop word removal

• Other transformations

Page 8: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

NLP Tools• NLTK

• http://www.nltk.org/

• http://text-processing.com

• OpenNLP• http://incubator.apache.org/opennlp/

• Stanford CoreNLP• http://nlp.stanford.edu/software/corenlp.shtml

• Mallet• http://mallet.cs.umass.edu/

• GATE

• http://gate.ac.uk/

• LingPipe

• http://alias-i.com/lingpipe/

Page 9: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Entity Extraction

• Finding entities, like People, Locations, Time, etc

• Some have ability to add your own entities (with seed terms)

• Tools

• OpenNLP

• Stanford CoreNLP

• OpenCalais

• GATE

Page 10: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Journalism Application

• Structuring unstructured data

• Social networks of entities

• Clustering

• Plotting data on time line

• Plotting locations on a map

Page 11: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Information Extraction

• Automatically identifies and extracts binary relationships from English sentences

• TextRunner

• NACTEM, MEDIE• http://www.nactem.ac.uk/medie/

• ReVerb

• http://reverb.cs.washington.edu/

Page 12: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Information Extraction: ReVerb

Page 13: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Question and Answer

• Parse the question to determine what type of information needs to be returned

• Leveraging approaches like the information extraction for retrieving the results

Page 14: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Journalism Application

• Engagement! Help users find facts relevant to their situations.

Page 15: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Document Classification

• Starts with a training set

• Predicts what class a document belongs

• Leveraging pure data mining approaches like Naïve Bayes, Decision Trees, Neural Networks

• Tools• NLTK• Mallet• Weka• Rapid Miner• GATE• Meandre

Page 16: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Journalism Application

• IBM's ManyBills project• Identifies the topic of each section in a Congressional bill for

the purposes of identifying outliers. • For example, if a Congressman proposes a bill about the

environment, but it has a section deep down about banking regulation, ManyBills would identify that as an outlier and highlight it.

Page 17: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Document Similarity/Clustering

• TF-IDF (Term frequency * inverse document frequency)

• Overview project (AP)• Tools

• GATE• Rapid Miner

Page 18: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Journalism Application

• Identifying copycat legislation from year to year

• Clustering documents to find trends

Page 19: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Topic Modeling

• Exploratory approach to find patterns by finding words that frequently occur together

• Document can have multiple topics• Words can exist in multiple topics• Tools

• Mallet uses LDA (latent Dirichlet allocation)• Other implementations as well…

Page 20: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Topic Exploration

• Topical Guide: • http://tg.byu.edu/

• Tmve (Topic Model Visualization Engine) • http://code.google.com/p/tmve/

Page 21: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Topical Guide

Page 22: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Tmve

Page 23: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Journalism Application

• Reporting tool for making sense of corpus• Isolating topics allows the user to focus only on the documents

in a corpus that are relevant. • There exists a clear potential for more data visualization.

Page 24: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Automatic Summarization• Identifies sentences from among the documents

• Identifies common information conveyed across all the documents and then reformulates new sentences expressing that information

• Aims to combine the main themes with completeness, readability, and conciseness

• Lots of algorithms, but not really software tools to download to run on your collection

• Meandre implements a HITS algorithm that identifies sentences but does not reformat them

Page 25: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Journalism Application

• Newsblaster• Summarizing all the news on the web• Every night, the system crawls a series of Web sites, downloads

articles, groups them together into "clusters" about the same topic, and summarizes each cluster.

• http://newsblaster.cs.columbia.edu/

• Ultimate Research Assistant• http://ultimate-research-assistant.com

Page 26: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Sentiment Analysis

• NLTK• http://www.text-processing.com/demo/sentiment

• APIs• AlchemyAPI• Open Dover• Lexalytics• Saplo

• Meandre (concept tracking)

• Sentiment Analysis Symposium• http://sentimentsymposium.com/agenda.html• May 8, 2012 in New York

Page 27: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Journalism Application

• Tracking Twitter sentiment about political candidates

• Comparing the tone of political statements over time or between candidates

Page 28: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Analysis Frameworks

• Meandre• http://seasr.org/meandre

• DocumentCloud• www.documentcloud.org/

• Rapid Miner

• http://rapid-i.com/

• Weka• http://www.cs.waikato.ac.nz/ml/weka/index.html

Page 29: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Locations

Components

Flows

• Web-based UI• Components and

flows are retrieved from server

• Additional locations of components and flows can be added to server

• Create flow using a graphical drag and drop interface

• Change property values

• Execute the flow

Meandre Workbench

Page 30: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Meandre Services from Firefox Plugin

Tag Cloud Analysis

Readability Analysis

Automatic Summarization

Network Analysis

Location Entity to Google Map

Date Entity to Simile Timeline

Example: Zotero, SEASR, Protovis, Google Maps,

Simile

Page 31: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Topic Modeling

Uses Mallet Topic Modeling to cluster nouns from almost 4000 documents from 19th century with 10 segments per document

Example below is clustering the Bible and shows 8 topics with at most 200 keywords for that topic

Page 32: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Concept Mapping

Sentiment Analysissix core emotions (Love, Joy, Surprise, Anger, Sadness, Fear)

Page 33: University of Illinois Tools for Unstructured Text NICAR 2012 Loretta Auvil, UIUC Chase Davis, CIR February 2012.

University of Illinois

Correlation Analysis

• Corrected OCR errs with spellchecking