Top Banner
LINGO Sandra Gama Search Results Clustering
61
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: LINGO Sandra Gama. Internet  endless document collection.

LINGO

Sandra Gama

Search Results Clustering

Page 2: LINGO Sandra Gama. Internet  endless document collection.

Internet endless document collection

Page 3: LINGO Sandra Gama. Internet  endless document collection.
Page 4: LINGO Sandra Gama. Internet  endless document collection.

Search Engines

Page 5: LINGO Sandra Gama. Internet  endless document collection.

NO question answering

Page 6: LINGO Sandra Gama. Internet  endless document collection.

FAST access to Web content

Page 7: LINGO Sandra Gama. Internet  endless document collection.

SENSITIVE to query quality

Page 8: LINGO Sandra Gama. Internet  endless document collection.

we NEED meaningful RESULTS

Page 9: LINGO Sandra Gama. Internet  endless document collection.

CLUSTERING!

Page 10: LINGO Sandra Gama. Internet  endless document collection.

GROUPING by Similarity

Page 11: LINGO Sandra Gama. Internet  endless document collection.

Semantic structure

Page 12: LINGO Sandra Gama. Internet  endless document collection.

Groups

Page 13: LINGO Sandra Gama. Internet  endless document collection.

Description

Page 14: LINGO Sandra Gama. Internet  endless document collection.

Luxury Car

Feline, panther family

Page 15: LINGO Sandra Gama. Internet  endless document collection.

Description QUALITY

Page 16: LINGO Sandra Gama. Internet  endless document collection.

How to cluster?

Page 17: LINGO Sandra Gama. Internet  endless document collection.

LINGOa new approach

Page 18: LINGO Sandra Gama. Internet  endless document collection.

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Page 19: LINGO Sandra Gama. Internet  endless document collection.

STAGE 1/4: PREPROCESSING

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Page 20: LINGO Sandra Gama. Internet  endless document collection.

STAGE 1/4: PREPROCESSING

1. Text segmentation

2. Stemming

3. Ignore stop words

Page 21: LINGO Sandra Gama. Internet  endless document collection.

STAGE 2/4: PHRASE EXTRACTION

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Page 22: LINGO Sandra Gama. Internet  endless document collection.
Page 23: LINGO Sandra Gama. Internet  endless document collection.
Page 24: LINGO Sandra Gama. Internet  endless document collection.

Goal

Page 25: LINGO Sandra Gama. Internet  endless document collection.

1/4 More than N occurrences

Page 26: LINGO Sandra Gama. Internet  endless document collection.

2/4 No more than 1 sentence

Page 27: LINGO Sandra Gama. Internet  endless document collection.

3/4 Complete phrase

Page 28: LINGO Sandra Gama. Internet  endless document collection.

4/4 Stop words

Page 29: LINGO Sandra Gama. Internet  endless document collection.

How it works

Page 30: LINGO Sandra Gama. Internet  endless document collection.
Page 31: LINGO Sandra Gama. Internet  endless document collection.

1 2 3 4 5 6 7 8 9 10 11

a b r a c a d a b r aHow many non-empty suffixes?

abracadabra

bracadabra

racadabra

acadabra

cadabra

adabra

dabra

abra

bra

ra

a

11 suffixes

Page 32: LINGO Sandra Gama. Internet  endless document collection.

abracadabra

bracadabra

racadabra

acadabra

cadabra

adabra

dabra

abra

bra

ra

a

Sorted Suffix Index

a 11

abra 8

abracadabra 1

acadabra 4

adabra 6

bra 9

bracadabra 2

cadabra 5

dabra 7

ra 10

racadabra 3

1 2 3 4 5 6 7 8 9 10 11 12

a b r a c a d a b r a $

1

2

3

4

5

6

7

8

9

10

11

Page 33: LINGO Sandra Gama. Internet  endless document collection.

Sorted Suffix Indexa 11

abra 8

abracadabra 1

acadabra 4

adabra 6

bra 9

bracadabra 2

cadabra 5

dabra 7

ra 10

racadabra 3

11 8 1 4 6 9 2 5 7 10 3Suffix array:

Page 34: LINGO Sandra Gama. Internet  endless document collection.

STAGE 3/4: CLUSTER-LABEL INDUCTION

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Page 35: LINGO Sandra Gama. Internet  endless document collection.
Page 36: LINGO Sandra Gama. Internet  endless document collection.

Singular Value Decomposition

Page 37: LINGO Sandra Gama. Internet  endless document collection.

A term x document matrix

U, ∑ , V such that A = U ∑ VTfind matrixes

Sandra Gama
Page 38: LINGO Sandra Gama. Internet  endless document collection.

D1: Large-scale singular value computationsD2: Software for the sparse singular value decompositionD3: Introduction to modern information retrievalD4: Linear algebra for intelligent information retrievalD5: Matrix computationsD6: Singular value cryptogram analysisD7: Automatic information organization

T1: InformationT2: SingularT3: ValueT4: ComputationsT5: Retrieval

P1: Singular valueP2: Information retrieval

Page 39: LINGO Sandra Gama. Internet  endless document collection.

D1: Large-scale singular value computationsD2: Software for the sparse singular value decompositionD3: Introduction to modern information retrievalD4: Linear algebra for intelligent information retrievalD5: Matrix computationsD6: Singular value cryptogram analysisD7: Automatic information organization

T1: InformationT2: Singular

T3: ValueT4: Computations

T5: Retrieval

D1 D2 D3 D4 D5 D6 D7

0.00 0.00 0.56 0.56 0.00 0.00 1.00

0.49 0.71 0.00 0.00 0.00 0.71 0.00

0.49 0.71 0.00 0.00 0.00 0.71 0.00

0.72 0.00 0.00 0.00 1.00 0.00 0.00

0.00 0.00 0.83 0.83 0.00 0.00 0.00

Page 40: LINGO Sandra Gama. Internet  endless document collection.

Abstract concept matrix (SVD)

0.00 0.75 0.00 -0.66 0.00

0.65 0.00 -0.28 0.00 -0.71

0.65 0.00 -0.28 0.00 0.71

0.39 0.00 0.92 0.00 0.00

0.00 0.66 0.00 0.75 0.00

U =

Page 41: LINGO Sandra Gama. Internet  endless document collection.

0.00 0.56 1.00 0.00 0.00 0.00 0.00

0.71 0.00 0.00 1.00 0.00 0.00 0.00

0.71 0.00 0.00 0.00 1.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 1.00 0.00

0.00 0.83 0.00 0.00 0.00 0.00 1.00

= PT1

: Inf

orm

ation

P2: I

nfor

mati

on re

trie

val

P1: S

ingu

lar v

alue

T2: S

ingu

lar

T4: C

ompu

tatio

ns

T3: V

alue

T5: R

etrie

val

T1: InformationT2: SingularT3: ValueT4: ComputationsT5: Retrieval

Page 42: LINGO Sandra Gama. Internet  endless document collection.

M matrix = UkTP

0.92 0.00 0.00 0.65 0.65 0.39 0.00

0.00 0.97 0.75 0.00 0.00 0.00 0.66

Phrases/single words

Abstractconcepts

T1: I

nfor

mati

on

P2: I

nfor

mati

on

retr

ieva

l

P1: S

ingu

lar v

alue

T2: S

ingu

lar

T4: C

ompu

tatio

ns

T3: V

alue

T5: R

etrie

val

Page 43: LINGO Sandra Gama. Internet  endless document collection.

Last step

Page 44: LINGO Sandra Gama. Internet  endless document collection.

Prune overlapping label descriptions

ZTZ

Page 45: LINGO Sandra Gama. Internet  endless document collection.

STAGE 4/4: CLUSTER-CONTENT ALLOCATION

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Page 46: LINGO Sandra Gama. Internet  endless document collection.

Similarity

Page 47: LINGO Sandra Gama. Internet  endless document collection.

Cluster Score

Page 48: LINGO Sandra Gama. Internet  endless document collection.

Evaluation and Results

Page 49: LINGO Sandra Gama. Internet  endless document collection.

Test Data

10 categories

4 subjects

Page 50: LINGO Sandra Gama. Internet  endless document collection.

Subject # docs Contents

Movies 77 Information about the BladeRunner movie

Movies 92 Information about the Lord of the Rings movie

Health Care 77 Orthopedic equipment and manufactures

Photography 15 Infrared-photography references

Computer Science 27 Articles about data warehouses (integrator DBs)

Computer Science 42 MySQL database

Computer Science 15 Native XML databases

Computer Science 38 PostgreSQL database

Computer Science 39 Java programming language tutorials and guides

Computer Science 37 VI text editor

Page 51: LINGO Sandra Gama. Internet  endless document collection.

Identifier Merged Categories

G1 LRings, MySQL

G3 LRings, MySQL, Ortho, Infra

G5 MySQL, XMLDB, Dware, Postgr, JavaTut, Vi

G6 MySQL, XMLDB, Dware, Postgr, Ortho

Page 52: LINGO Sandra Gama. Internet  endless document collection.

Identifier Merged Categories

G1 Fan fiction/fan art, image galleries, MySQL, wallpapers, LOTR humour, links

G3 MySQL, news, information on infrared, image galleries, foot orthotics, Lord of the Rings, movie

G5 Java tutorial, Vim page, federated data warehouse, native XML database, Web, Postgresql database

G6 MySQL database, federated data warehouse, foot orthotics, orthopedic products, access Postgresql, Web

Page 53: LINGO Sandra Gama. Internet  endless document collection.

Cluster Contamination

Analytical evaluation:

Page 54: LINGO Sandra Gama. Internet  endless document collection.

LINGO vs. Suffix Tree Clustering

Page 55: LINGO Sandra Gama. Internet  endless document collection.
Page 56: LINGO Sandra Gama. Internet  endless document collection.
Page 57: LINGO Sandra Gama. Internet  endless document collection.

CONCLUSIONS

Page 58: LINGO Sandra Gama. Internet  endless document collection.

Future work

Page 59: LINGO Sandra Gama. Internet  endless document collection.

Pointer

Page 60: LINGO Sandra Gama. Internet  endless document collection.

Communication!

Page 61: LINGO Sandra Gama. Internet  endless document collection.

LINGOThank you.

Search Results Clustering