Text is fun: Statistical exploration of large corpora Siva Reddy Lexical Computing Ltd, UK http://sketchengine.co.uk Centre for Exact Humanities (CEH) IIIT Hyderabad May 14 2012 Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 1 / 29
43
Embed
Text is fun: Statistical exploration of large corpora · 2014-11-09 · Vector Space Models (VSMs) of Semantics Interpret semantics using VSM Backbone: Distributional Hypothesis Text
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Text is fun: Statistical exploration of large corpora
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 16 / 29
When do you say two words are similar?
Distributional Hypothesis (Harris, 1954)
The words that occur in similar contexts tend to have similar meaning
e.g: laptop, computer
Backbone for Vector Space Model of Semantics.
Firth (Firth, 1957)
You shall know a person from his friends - Chinese Proverb
You shall know a word from its context - Firth’s Principle
Bag of words hypothesis
Two documents tend to be similar if they have similar distribution of similarwords
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 17 / 29
When do you say two words are similar?
Distributional Hypothesis (Harris, 1954)
The words that occur in similar contexts tend to have similar meaning
e.g: laptop, computer
Backbone for Vector Space Model of Semantics.
Firth (Firth, 1957)
You shall know a person from his friends - Chinese Proverb
You shall know a word from its context - Firth’s Principle
Bag of words hypothesis
Two documents tend to be similar if they have similar distribution of similarwords
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 17 / 29
Vector Space Models (VSMs) of Semantics
Interpret semantics using VSMBackbone: Distributional Hypothesis
Text entity (we are interested in) as a Vector (point) in dimensional space.
Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)
term-documentterm-contextpair-pattern
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 29
Vector Space Models (VSMs) of Semantics
Interpret semantics using VSMBackbone: Distributional Hypothesis
Text entity (we are interested in) as a Vector (point) in dimensional space.
Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)
term-documentterm-contextpair-pattern
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 29
Vector Space Models (VSMs) of Semantics
Interpret semantics using VSMBackbone: Distributional Hypothesis
Text entity (we are interested in) as a Vector (point) in dimensional space.
Context of the entity as dimensionsExisting methods represent knowledge in VSMs mainly in three types(Turney and Pantel, 2010)
term-documentterm-contextpair-pattern
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 18 / 29
Term-Document: (Salton et al., 1975)
1
d1: Human machine interface for Lab ABC computer applications
1Image courtesy: (Landauer et al., 1998)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 19 / 29
Term-Document: (Salton et al., 1975)
2
Document similarity can be found using Cosine similarity
sim(D1,D2) = D1.D2‖D1‖‖D2‖
2Image courtesy: (Salton et al., 1975)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 20 / 29
Term-Document: (Salton et al., 1975)
2
Document similarity can be found using Cosine similarity
sim(D1,D2) = D1.D2‖D1‖‖D2‖
2Image courtesy: (Salton et al., 1975)Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 20 / 29
Term-Context: Word Space Model
Meaning of a word as a vector (Schütze, 1998)
Meaning of a word is represented as a cooccurrence vector built from a corpus
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 25 / 29
Beyond Words: Compositional Semantics
Given meanings of
couch
roast
potato
Can we interpret the meanings of
couch potato
roast potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 25 / 29
Couch Potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 26 / 29
Roast Potato
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 27 / 29
Bibliography I
Baroni, M., Kilgarriff, A., Pomikalek, J., and Rychly, P. (2006). Webbootcat:Instant domain-specific corpora to support human translators. InProceedings of the 11th Annual Conference of the European Association forMachine Translation (EAMT), Norway.
Firth, J. R. (1957). A Synopsis of Linguistic Theory, 1930-1955. Studies inLinguistic Analysis, pages 1–32.
Harris, Z. S. (1954). Distributional structure. Word, 10:146–162.
Kilgarriff, A., Reddy, S., Pomikálek, J., and PVS, A. (2010). A corpus factory formany languages. In Proceedings of the Seventh International Conferenceon Language Resources and Evaluation (LREC’10), Valletta, Malta.
Landauer, T. K., Foltz, P. W., and Laham, D. (1998). An introduction to latentsemantic analysis. Discourse Processes, 25:259–284.
Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model forautomatic indexing. Commun. ACM, 18:613–620.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 28 / 29
Bibliography II
Schütze, H. (1998). Automatic Word Sense Discrimination. ComputationalLinguistics, 24(1):97–123.
Turney, P. D. and Pantel, P. (2010). From frequency to meaning: vector spacemodels of semantics. J. Artif. Int. Res., 37:141–188.
Siva Reddy (Lexical Computing Ltd, UK) Text is fun: Statistical exploration of large corpora IIIT Hyderabad, India 29 / 29