Top Banner
Topic Analysis in ARCOMEM Yahoo Research Barcelona
10

Arcomem training Topic Analysis Models advanced

Jun 18, 2015

Download

Technology

arcomem

This presentation on Topic Analysis Models is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Arcomem training Topic Analysis Models advanced

Topic Analysis in ARCOMEM

Yahoo Research Barcelona

Page 2: Arcomem training Topic Analysis Models advanced

What is Probabilistic Topic Modelling?

Exploring and retrieving meaningful information from large collections of textual documents is a challenging task

Probabilistic topic models are a suite of algorithms (a framework) that aim to discover and annotate large archives of documents

with thematic information.

They do not require any prior annotations or labeling of the documents.

Topics emerge from the statistical analysis of the original texts

Page 3: Arcomem training Topic Analysis Models advanced

Probabilistic Topic ModelTopic models are based upon the idea that documents are mixtures

of topics, where a topic is a probability distribution over a fixed vocabulary.

A topic model is a generative model for documents: it specifies a simple probabilistic procedure by which documents can be generated.

The idea is to study the co-occurrence of words, assuming that words that tend to co-occur frequently, express, or belong to, the

same semantic concept.

Example: A document (d) can be represented by the following mixture of topics

Biology PhysicsMathemati

cs

0,6 0,3 0,1In the topic “Biology” words such as “Dna, genetic, evolution” have high probability

Page 4: Arcomem training Topic Analysis Models advanced

Intuition behind topic modelling

Documents exhibit multiple topics

Each topic is individually interpretable, providing a probability distribution over words that picks out a coherent cluster of correlated terms

Evolution BiologyGeneticsStatistical Analysis

Page 5: Arcomem training Topic Analysis Models advanced

Generative process

We only observe the documents

Our goal is to infer the underlying topic structure

What are the topics?

How are the documents divided according to those topics?

Topic 1: ?Topic 1: ?

Topic 2: ?Topic 2: ?

time seriesnonlinearmathematicsgeometricdynamics

Ecologistpopulationspeciesnaturalnature human

Page 6: Arcomem training Topic Analysis Models advanced

Text Modeling

A word is the basic unit of discrete data, defined to be an item from a vocabulary indexed by {1, . . . , V }.

A document is a sequence of N words denoted by w = (w1,w2,... ,wN), where wn is the nth word in the sequence.

A corpus is a collection of M documents denoted by D = {w1,w2,... ,wM}.

Bag-of-words assumption: the only information relevant to the model is the number of times words are produced. We don’t

consider word-order!!!!

Page 7: Arcomem training Topic Analysis Models advanced

Latent Dirichlet Allocation

Page 8: Arcomem training Topic Analysis Models advanced

The challenge is to identify, for each campaign, significant and important topics that are relevant to the two user cases, broadcasting

and parliament libraries.

Topic analysis provides semantic useful categories which allow end-users to search and browse content archives.

Page 9: Arcomem training Topic Analysis Models advanced

Try out on SARA: Trending topics

Page 10: Arcomem training Topic Analysis Models advanced

Try out on SARA: Statistical Topic Models