Top Banner
LINGO Sandra Gama Search Results Clustering
61

LINGO

Feb 25, 2016

Download

Documents

kinsey

LINGO. Search Results Clustering. Sandra Gama. Internet  endless document collection . Search Engines. NO question answering. FAST access to Web content. SENSITIVE to query quality. we NEED meaningful RESULTS. CLUSTERING!. GROUPING by Similarity. Semantic structure. Groups. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: LINGO

LINGO

Sandra Gama

Search Results Clustering

Page 2: LINGO

Internet endless document collection

Page 3: LINGO
Page 4: LINGO

Search Engines

Page 5: LINGO

NO question answering

Page 6: LINGO

FAST access to Web content

Page 7: LINGO

SENSITIVE to query quality

Page 8: LINGO

we NEED meaningful RESULTS

Page 9: LINGO

CLUSTERING!

Page 10: LINGO

GROUPING by Similarity

Page 11: LINGO

Semantic structure

Page 12: LINGO

Groups

Page 13: LINGO

Description

Page 14: LINGO

Luxury Car

Feline, panther family

Page 15: LINGO

Description QUALITY

Page 16: LINGO

How to cluster?

Page 17: LINGO

LINGOa new approach

Page 18: LINGO

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Page 19: LINGO

STAGE 1/4: PREPROCESSING

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Page 20: LINGO

STAGE 1/4: PREPROCESSING

1. Text segmentation

2. Stemming

3. Ignore stop words

Page 21: LINGO

STAGE 2/4: PHRASE EXTRACTION

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Page 22: LINGO
Page 23: LINGO
Page 24: LINGO

Goal

Page 25: LINGO

1/4 More than N occurrences

Page 26: LINGO

2/4 No more than 1 sentence

Page 27: LINGO

3/4 Complete phrase

Page 28: LINGO

4/4 Stop words

Page 29: LINGO

How it works

Page 30: LINGO
Page 31: LINGO

1 2 3 4 5 6 7 8 9 10 11

a b r a c a d a b r aHow many non-empty suffixes?

abracadabra

bracadabra

racadabra

acadabra

cadabra

adabra

dabra

abra

bra

ra

a

11 suffixes

Page 32: LINGO

abracadabra

bracadabra

racadabra

acadabra

cadabra

adabra

dabra

abra

bra

ra

a

Sorted Suffix Index

a 11

abra 8

abracadabra 1

acadabra 4

adabra 6

bra 9

bracadabra 2

cadabra 5

dabra 7

ra 10

racadabra 3

1 2 3 4 5 6 7 8 9 10 11 12

a b r a c a d a b r a $

1

2

3

4

5

6

7

8

9

10

11

Page 33: LINGO

Sorted Suffix Indexa 11

abra 8

abracadabra 1

acadabra 4

adabra 6

bra 9

bracadabra 2

cadabra 5

dabra 7

ra 10

racadabra 3

11 8 1 4 6 9 2 5 7 10 3Suffix array:

Page 34: LINGO

STAGE 3/4: CLUSTER-LABEL INDUCTION

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Page 35: LINGO
Page 36: LINGO

Singular Value Decomposition

Page 37: LINGO

A term x document matrix

U, ∑ , V such that A = U ∑ VTfind matrixes

Sandra Gama
Page 38: LINGO

D1: Large-scale singular value computationsD2: Software for the sparse singular value decompositionD3: Introduction to modern information retrievalD4: Linear algebra for intelligent information retrievalD5: Matrix computationsD6: Singular value cryptogram analysisD7: Automatic information organization

T1: InformationT2: SingularT3: ValueT4: ComputationsT5: Retrieval

P1: Singular valueP2: Information retrieval

Page 39: LINGO

D1: Large-scale singular value computationsD2: Software for the sparse singular value decompositionD3: Introduction to modern information retrievalD4: Linear algebra for intelligent information retrievalD5: Matrix computationsD6: Singular value cryptogram analysisD7: Automatic information organization

T1: InformationT2: Singular

T3: ValueT4: Computations

T5: Retrieval

D1 D2 D3 D4 D5 D6 D7

0.00 0.00 0.56 0.56 0.00 0.00 1.00

0.49 0.71 0.00 0.00 0.00 0.71 0.00

0.49 0.71 0.00 0.00 0.00 0.71 0.00

0.72 0.00 0.00 0.00 1.00 0.00 0.00

0.00 0.00 0.83 0.83 0.00 0.00 0.00

Page 40: LINGO

Abstract concept matrix (SVD)

0.00 0.75 0.00 -0.66 0.00

0.65 0.00 -0.28 0.00 -0.71

0.65 0.00 -0.28 0.00 0.71

0.39 0.00 0.92 0.00 0.00

0.00 0.66 0.00 0.75 0.00

U =

Page 41: LINGO

0.00 0.56 1.00 0.00 0.00 0.00 0.00

0.71 0.00 0.00 1.00 0.00 0.00 0.00

0.71 0.00 0.00 0.00 1.00 0.00 0.00

0.00 0.00 0.00 0.00 0.00 1.00 0.00

0.00 0.83 0.00 0.00 0.00 0.00 1.00

= PT1

: Inf

orm

ation

P2: I

nfor

mati

on re

triev

al

P1: S

ingu

lar v

alue

T2: S

ingu

lar

T4: C

ompu

tatio

ns

T3: V

alue

T5: R

etrie

val

T1: InformationT2: SingularT3: ValueT4: ComputationsT5: Retrieval

Page 42: LINGO

M matrix = UkTP

0.92 0.00 0.00 0.65 0.65 0.39 0.00

0.00 0.97 0.75 0.00 0.00 0.00 0.66

Phrases/single words

Abstractconcepts

T1: I

nfor

mati

on

P2: I

nfor

mati

on

retr

ieva

l

P1: S

ingu

lar v

alue

T2: S

ingu

lar

T4: C

ompu

tatio

ns

T3: V

alue

T5: R

etrie

val

Page 43: LINGO

Last step

Page 44: LINGO

Prune overlapping label descriptions

ZTZ

Page 45: LINGO

STAGE 4/4: CLUSTER-CONTENT ALLOCATION

Pre-processing

Phrase extraction

Cluster-Label Induction

Cluster-content allocation

Filtered docs

Frequent phrases

Cluster labels

user query

clustered documents

Page 46: LINGO

Similarity

Page 47: LINGO

Cluster Score

Page 48: LINGO

Evaluation and Results

Page 49: LINGO

Test Data

10 categories

4 subjects

Page 50: LINGO

Subject # docs Contents

Movies 77 Information about the BladeRunner movie

Movies 92 Information about the Lord of the Rings movie

Health Care 77 Orthopedic equipment and manufactures

Photography 15 Infrared-photography references

Computer Science 27 Articles about data warehouses (integrator DBs)

Computer Science 42 MySQL database

Computer Science 15 Native XML databases

Computer Science 38 PostgreSQL database

Computer Science 39 Java programming language tutorials and guides

Computer Science 37 VI text editor

Page 51: LINGO

Identifier Merged Categories

G1 LRings, MySQL

G3 LRings, MySQL, Ortho, Infra

G5 MySQL, XMLDB, Dware, Postgr, JavaTut, Vi

G6 MySQL, XMLDB, Dware, Postgr, Ortho

Page 52: LINGO

Identifier Merged Categories

G1 Fan fiction/fan art, image galleries, MySQL, wallpapers, LOTR humour, links

G3 MySQL, news, information on infrared, image galleries, foot orthotics, Lord of the Rings, movie

G5 Java tutorial, Vim page, federated data warehouse, native XML database, Web, Postgresql database

G6 MySQL database, federated data warehouse, foot orthotics, orthopedic products, access Postgresql, Web

Page 53: LINGO

Cluster Contamination

Analytical evaluation:

Page 54: LINGO

LINGO vs. Suffix Tree Clustering

Page 55: LINGO
Page 56: LINGO
Page 57: LINGO

CONCLUSIONS

Page 58: LINGO

Future work

Page 59: LINGO

Pointer

Page 60: LINGO

Communication!

Page 61: LINGO

LINGOThank you.

Search Results Clustering