National Technical University of Athens John Violos, Konstantinos Tserpes, Athanasios Papaoikonomou, Magdalini Kardara, Theodora Varvarigou SUPER Social sensors for secUrity Assessments and Proactive EmeRgencies management PCI 2014 18th Panhellenic Conference in Informatics Clustering Documents using the 3- Gram Graph Representation Model 3 / 10 / 2014
Clustering Documents using the 3-Gram Graph Representation Model. SUPER Social sensors for secUrity Assessments and Proactive EmeRgencies management. National Technical University of Athens. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
National Technical University of Athens
John Violos, Konstantinos Tserpes, Athanasios Papaoikonomou, Magdalini Kardara, Theodora Varvarigou
SUPERSocial sensors for secUrity Assessments and
Proactive EmeRgencies management
PCI 2014 18th Panhe l len ic Conference in In format ics
Clustering Documents using the 3-Gram Graph Representation Model
3 / 10 / 2014
2
SUPER
3 / 10 / 2014PCI 2014 18th Panhe l len ic Conference in In format ics
3
SUPER
Hurricane Sandy 2012• 20 million tweets • 10pics/sec Instagram
Virginia U.S. 2011 5.8 Richter• 40.000 tweets hit the 1st
min
3 / 10 / 2014PCI 2014 18th Panhe l len ic Conference in In format ics
Detect Topic Communities in Social Networks.
• Texts of Users, • Social Graph,• Actions (likes, follow).
4
Topic Communities
3 / 10 / 2014PCI 2014 18th Panhe l len ic Conference in In format ics
Users write texts about :• Interests• Habits• Events in their life
Cluster texts in topics => Cluster their writers in topic communities.
5
Text Clustering
3 / 10 / 2014PCI 2014 18th Panhe l len ic Conference in In format ics
6
LDA Latent Dirichlet
Allocation
What is the weakness?It is a bag of words model.
3 / 10 / 2014PCI 2014 18th Panhe l len ic Conference in In format ics
The sequence of words is a valuable information.
FurthermoreDerivative of Words are Similar Words.
We need a representation model:• Keeps the information of the word sequence.• Captures the similarity between derivatives of
words.
A good solution is the N-Gram Graphs!7
Sequence of Words
3 / 10 / 2014PCI 2014 18th Panhe l len ic Conference in In format ics
Basic Steps
Input: Corpus of texts, number of Clusters k.1. Ngram Graph that represents the Corpus.2. Ngram Graph that represents each text.3. Partition of the Corpus Graph (k subgraphs).4. Comparison between each text with all
partitions.5. Allocation for each text to the cluster with
the highest comparison result. Output: k Clusters with the texts which include. 8
Overview
PCI 2014 18th Panhe l len ic Conference in In format ics 3 / 10 / 2014
What is the N-Grams?
An N-gram is a contiguous sequence of N items from a given sequence of text.
The items can be phonemes, syllables, letters, words.