Top Banner
DH TOOLS Introduction to Text Analysis Cameron Buckner Visiting Assistant Professor Department of Philosophy [email protected]
29

DH Tools Workshop #1: Text Analysis

Nov 01, 2014

Download

Education

cjbuckner

A text extraction workshop delivered by Cameron Buckner on Friday, October 18th, 2012 as part of the University of Houston Digital Humanities Initiative.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DH Tools Workshop #1:  Text Analysis

DH TOOLSIntroduction to Text Analysis

Cameron Buckner

Visiting Assistant Professor

Department of Philosophy

[email protected]

Page 2: DH Tools Workshop #1:  Text Analysis

Our Initiative• Promote, facilitate, interact• Reading group• Speaker series• Infrastructure advocacy

• Tools workshops• Grantwriting support

http://www.uh.edu/class/digitalhumanities/

Page 3: DH Tools Workshop #1:  Text Analysis

RoadmapGoal today: Analyze texts using cutting-edge analyses from computational psycholinguistics with an off-the-shelf tool, word2word

1. What can you do with text analysis?

2. A little bit of theory: Semantic spaces

3. BEAGLE: The holographic lexicon

4. MDS: Visualizing multidimensional networks

5. Examples

6. Hands-on play

Page 4: DH Tools Workshop #1:  Text Analysis

What is DH?• Computation and interpretation• The use of computational tools for the

production, exploration, analysis, and dissemination of humanistic knowledge• Thread common between new and old:

pattern recognition

• Includes• Digitization and archiving, markup• Analysis & visualization• Search & dissemination• Pedagogy

Page 5: DH Tools Workshop #1:  Text Analysis

Methods of Text Analysis I• Statistical analysis, information extraction, machine

learning• Syntactic: word frequencies (Google n-grams), vocabulary

usage, stylometry (authorship and genre), Pagerank

http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html

Page 6: DH Tools Workshop #1:  Text Analysis

Methods of Text Analysis II• Semantic: tf-idf, latent semantic analysis, latent dirichlet

allocation, entropy-based measures, ontologies• Aim to model relevance, semantic similarity, taxonomic

relationships, object properties and relations

Page 7: DH Tools Workshop #1:  Text Analysis

Reminders• Be creative and have fun, but if you want to publish…• Be principled:• Junk in, junk out• Always know assumptions required by a method• Analyses should hold up under trivial transformations of data

representation

• Be prepared for pragmatic design decisions• Go in with hypotheses and structured questions• Confirm with careful humanistic interpretation

Page 8: DH Tools Workshop #1:  Text Analysis

The Mental Lexicon• A “mental dictionary”• Contains information about:• Word meaning, grammatical roles, taxonomic relations, typical properties• Behavioral indicators: recognition speed, synonymy and relevance

judgments, priming, frequency effects, categorization

Page 9: DH Tools Workshop #1:  Text Analysis

BEAGLE• A model that learns (unsupervised) a holographic mental

lexicon automatically from text• History: Two approaches to semantic analysis• Co-occurrence based measures (“bag of words”, LSA, tf-idf)• Good at determining relevance, bad at determining roles and

relations

• Order-based measures (n-gram models, generative grammars, hidden Markov models)• Good at identifying grammatical and structural relations, bad at

identifying relevance and meaning

• Challenge: Can the two be combined?

Page 10: DH Tools Workshop #1:  Text Analysis

Context + Role• Assumption: People acquire an

idiosyncratic mental lexicon from patterns of co-occurrence and syntactic relationships they encounter in natural language.• “You shall know a word by both the

company it keeps and how it keeps it.”

• Goal: If we could build a representation of a text’s context/role distributions, we could predict the structure of a mental lexicon that produced a corpus and/or that would be produced by it• Texts as “mental fingerprints”

Page 11: DH Tools Workshop #1:  Text Analysis

HowHologram

sWork

Page 12: DH Tools Workshop #1:  Text Analysis

Basic Vector Approach1. Start with a multi-dimensional vector space

2. Each term meaning is initially represented by a random, constant environment vector and an empty memory vector

3. Associations between terms can be represented by adding or averaging their environment vectors into their memory vectors

4. Each time terms co-occur, their memory vectors become closer in multi-dimensional similarity space

Page 13: DH Tools Workshop #1:  Text Analysis

Representing Order Info• Convolution: compressing outer-product matrix of two

term vectors so that the product contains recoverable information about both

• Example: z = x * y• Association vector z contains information about both x and y• Can (approximately) reconstruct source vector y by probing

z (deconvolution) with x (and vice versa)

• Combined BEAGLE memory vector: Context memory comes from vector addition, and order information comes from n-gram binding using convolution

Page 14: DH Tools Workshop #1:  Text Analysis
Page 15: DH Tools Workshop #1:  Text Analysis
Page 16: DH Tools Workshop #1:  Text Analysis

Combined Memory Vector

• m = memory vector• e = initial random environment vector• p = position in sentence• lambda = constant chunking factor (size of n-gram window)• bind i,j = a non-commutative convolution of constant order vector

with other environment vectors in n-gram

Page 17: DH Tools Workshop #1:  Text Analysis
Page 18: DH Tools Workshop #1:  Text Analysis
Page 19: DH Tools Workshop #1:  Text Analysis

Resonance retrieval…

Page 20: DH Tools Workshop #1:  Text Analysis

So, BEAGLE method

1. Choose number of dimensions for vector space, size of n-gram window for order info

2. Clean up source documents using standard NLP (stop words, stemmers, etc.)

3. Learn context and order vectors from corpus, combine

4. Select words of interest

5. Visualize multi-dimensional space using favorite method (e.g. MDS)

Page 21: DH Tools Workshop #1:  Text Analysis

Limitations of BEAGLE• Only considers 1-sentence windows• Lexical ambiguity• Valence (e.g. synonyms, antonyms)

Page 22: DH Tools Workshop #1:  Text Analysis

MDS• A way to view a multi-dimensional similarity space• Collapses multi-dimensional space in way that tries to

mutually preserve distances between vectors• Collapsing dimensions often reveals most significant

[higher-order] dimensions

Page 23: DH Tools Workshop #1:  Text Analysis

Uses• How do two academic reference works compare in their

coverage of a discipline?• Biases? Overlap?

InPhO-Semantics

Credit: Robert Rose

Page 24: DH Tools Workshop #1:  Text Analysis

Black = SEP, Red = IEP

Credit: Jun Otsuka

Page 25: DH Tools Workshop #1:  Text Analysis

Political rhetoric• What can we learn from the “semantic space” derived

from a party or candidate’s rhetoric?• Central issues?• Key comparisons?• Ideological focus/big tent?• Location on ideological spectrum?

• Example: compare speeches from Republican and Democratic political conventions

Page 26: DH Tools Workshop #1:  Text Analysis

Heat Map: Terms most diagnostic of a speech’s being delivered by a Democrat“Hotter” indicates more diagnostic in comparison. Hottest terms = aarp, experience, affordable, abuelo, billionaires, afghanistan, beijing, biofuels, aliens

Page 27: DH Tools Workshop #1:  Text Analysis

Character Analysis• Moretti: “protagonist is the character that minimized the

sum of the distances to all other vertices”• (But Moretti did it by hand!)

Page 28: DH Tools Workshop #1:  Text Analysis

Character similarity analysis from A Dance with Dragons

Page 29: DH Tools Workshop #1:  Text Analysis

Acknowledgements

Brent Kievet-Kylarword2word

Mike JonesBEAGLE

InPhO Team