Document Document Maps Maps Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems"
32
Embed
Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
In the so-called vector model a document is considered as a vector in space spanned by the words it contains.
dogfood
walk
My dog likes this food
When walking, I take some food
Document model in search engines
The relevance of a document to a query or to another document is measured as cosine of angle between the query and the document.
dogfood
walk
Query: walk
Reference vector representation
Vectors are sparse by nature During learning process they become even
sparser Represented as a balanced red-black trees Tolerance threshold imposed Terms (dimensions) below threshold are removed Significant complexity reduction without
negative quality impact
Topic-sensitive initialization
Inter-topic similarities important both for map learning and visualization/cluster extraction
Simple approach:– Use LSI to select K main broad topics– Select K map cells (evenly spread over the map) as
the fixpoints for individual topics– Initialize selected fixpoints with broad topics– Initialize remaining cells with „in-between values”
Clustering document vectors
Document space 2D map
mxr
Mocna zmiana położenia (gruba
strzałka)
Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar
Joint winner search
Global winner search: accurate but slow Local winner search: faster but can be inaccurate
during rapid changes Start with single phase of global search Document movements become more smooth
during learning process: usually local search is enough
Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease)
Top-down approach is possible but requires fixpoints
21-28
Clustering document groups Numerous methods exists but none of them directly
applicable:– Extremely fuzzy structure of topical groups in SOM cells– Neccesity of taking into account similiarity measures both in
original document space and in the map space– Outlier-handling problem during cluster formation– No a priori estimation of the number of topical groups
Fuzzy C-MEANS on lattice of map cells applied Graph theoretical approach (density- and distance- based
MST) combined with fuzzy clustering Clustered documents are labeled by weighted centroids of
cell reference vectors scaled with between-group entropy
Experiments with map convergence
We examined the convergence of the maps to a stable state depending on:– type of alpha function (search radius
reduction)– type of winner search method– type of initialization method
Convergence – alpha functions (linear versus reciprocal)
Convergence – winner search (joint versus local)
Experiments with execution time
The impact of the following factors on the speed of map creation was investigated:– Map size (total number of cells)– Optimization methods:
Map quality assessment:– Compare with ‘ideal’ map (e.g. without optimizations)– Identical initialization and learning parameters– Compute sum of squared distances of location of each
document on both maps
Execution time - map size
Execution time - optimizations
Future research
Maps for joint term-citation model, taking into account between-group link flow direction
Fully distributed map creation Adaptive document retrieval and clustering:
– Bayesian network based relevance measure– Survival models for document update rate estimation– Dead link propagation methods for page freshness estimation
We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects
Future research
Bayesian networks will be applied in particular to: – measure relevance and classify documents– accelerate document clustering processes– construct a thesaurus supporting query