Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish Academy of Sciences Warsaw
Dec 27, 2015
Mining document maps
Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski
Mariusz Kujawiak
Institute of Computer SciencePolish Academy of Sciences
Warsaw
SAWM 2004 Mining Document Maps
Agenda
Motivation
Our approach
Architecture
User interface
Visualization
Map creation
Clustering
Experimental results
Future directions
SAWM 2004 Mining Document Maps
Motivation
The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore
A good way of presenting massive document sets in an understandable form will be crucial in the near future
The BEATCA project targets at creation a full-fledged search engine for moderate size document collections (millions of documents) capable of representing on-line replies to queries in user-friendly graphical form on a document map (based on WebSOM approach)
SAWM 2004 Mining Document Maps
Our approach XXXX
The presentation method is based on the WebSOM's map idea and is enriched with novel methods of document analysis, clustering and visualization.A special architecture has been elaborated to enable experiments with various brands of map creation, visualization, clustering and labelling algorithmsB ayesianE volutionaryA pproach toT extC onnectivityA nalysis
SAWM 2004 Mining Document Maps
BEATCA architecture XXXXX
The preparation of documents is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representationIndexer also identifies frequent phrases in document set for clustering and labelling purposesSubsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excludedThe map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query
SAWM 2004 Mining Document Maps
BEATCA architecture
........
INTERNET
DBREGISTRY
HT-Base
HT-Base
VEC-BaseMAP-Base
DocGR-Base
Search Engine
Indexing +Optimizing
SpiderDownloading
MappingClustering
of docs
........
CellGR-Base
Clusteringof cells
........
........ ........ ........
Processing Flow Diagram - BEATCA
SAWM 2004 Mining Document Maps
Example: summaries of documents KONIEC
SAWM 2004 Mining Document Maps
Example: S&W frequent phrases KONIEC
sheep fiendsdairy goatsblack sheepspecial thankssheep and goatsuniversity medical centerpublic healthmedical informaticsinformation departmentspharmacy relateddrugs informationhealth care
SAWM 2004 Mining Document Maps
User interface XXXX
Search results are presented on a document map
Compact (fuzzy) topical areas are extracted
Query-related summaries are generated on-line
Maps can have one of the following topologies:the traditional flat map (quadratic or hexagonal cells)
rotating 3D map (torus, sphere, cylinder)
hyperbolic map (Poincarre or Klein projections)
growing map (Growing Neural Gas)
SAWM 2004 Mining Document Maps
User interface
SAWM 2004 Mining Document Maps
Map visualizations in 3D
SAWM 2004 Mining Document Maps
Hyperbolic map visualizations
triangular tesselation
hexagonal tesselation
SAWM 2004 Mining Document Maps
Kohonen learning overview XXXX
Unsupervised learning neural network model
Neuron represented by reference vector in document space
Vector element (term dimension) equals TFxIDF
Iterative regression of reference vectors onto document vector space: #WZÓR#
Similiarity is computed as cosine of angle between corresponding vectors
SAWM 2004 Mining Document Maps
How are the maps created
A modified WebSOM method is used:compact reference vectors representation
broad-topic initialization method
joint winner search method
multi-level (hierarchical) maps
three-phase document clustering:• initial grouping via PLSA/PHITS
• WEBSOM on document groups
• fuzzy cell clusters extraction and labelling
SAWM 2004 Mining Document Maps
Reference vector representation
Vectors are sparse by nature
During learning process they become even sparser
Represented as a balanced red-black trees
Tolerance threshold imposed
Terms (dimensions) below threshold are removed
Significant complexity reduction without negative quality impact
SAWM 2004 Mining Document Maps
Topic-sensitive initialization
Inter-topic similarities important both for map learning and visualization/cluster extraction
Simple approach:Use LSI to select K main broad topics
Select K map cells (evenly spread over the map) as the fixpoints for individual topics
Initialize selected fixpoints with broad topics
Initialize remaining cells with the following rule: #WZÓR#
SAWM 2004 Mining Document Maps
Joint winner search
Global winner search: accurate but slow
Local winner search: faster but can be inaccurate during rapid changes
Start with single phase of global search
Document movements become more smooth during learning process: usually local search is enough
Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease)
SAWM 2004 Mining Document Maps
Hierarchical maps
Bottom-up approach
Feasible (with joint winner search method)
Start with most detailed map
Compute weighted centroids of map areas: #WZÓR#
Use them as seeds for coarser map
Top-down approach is possible but requires fixpoints
SAWM 2004 Mining Document Maps
Clustering document groups
Numerous methods exists but none of them directly applicable:
Extremely fuzzy structure of topical groups in SOM cells
Neccesity of taking into account similiarity measures both in original document space and in the map space
Outlier-handling problem during cluster formation
No a priori estimation of the number of topical groups
Fuzzy C-MEANS on lattice of map cells applied
Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering
Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy
SAWM 2004 Mining Document Maps
Example: biomedical documents
SAWM 2004 Mining Document Maps
Term RankCluster #1
sci.math
Cluster #2
sci.med / sci.math
Cluster #3
talk.religion misc
Cluster #4
soc.culture.
israel
Cluster #5
comp.
windows.x
Cluster #6
talk.
politics.misc
1 Die Cipher Men Israel Boot Funding
2 Probable Block Women Palestinian Windows Study
3 Theory Stream Raped Gun Files Taxes
4 Registers Key Children Aziz Menus Stock
5Mathematics
Algorithms Child Iraqis Lib Health
6 EquationCombinations
Sex Koppel Icon Market
7 Cos Distinction Soc Israeli Label Social
8 Sequence Encryption Father Jews Folder Mercer
9 Tex Epimethius Paternity Resolution Msvcrtd Governing
10 SpaceRandomness
Feminist Oliver Shortcut Vaccinations
11Gravitational
Smartcard Trolling Utah NetzeroMeasurement
12 Wave Entropy White Firearms Tab Bushes
13Latex Yahoo England
Settlements
Kernel Computer
14 Files Model Support Palestine Installed Companies
15 Unsigned Lottery Black Permitted Backup Diabetes
Label candidates (5 newsgroups) XXX
SAWM 2004 Mining Document Maps
Experiments with execution time XXX
The impact of the following factors on the speed of map creation was investigated:
Map size (total number of cells)
Optimization methods:• dictionary optimization
• reference vector representation
Map quality assessment:Compare with ‘ideal’ map (e.g. without optimizations)
Identical initialization and learning parameters
Compute sum of squared distances of location of each document on both maps
SAWM 2004 Mining Document Maps
Execution time - map size
SAWM 2004 Mining Document Maps
Execution time - optimizations
SAWM 2004 Mining Document Maps
Experiments with map convergence XXX
We examined the convergence of the maps to a stable state depending on:
type of alpha function (search radius reduction)
type of winner search method
type of initialization method
SAWM 2004 Mining Document Maps
Convergence – alpha functions
SAWM 2004 Mining Document Maps
Convergence – winner search
SAWM 2004 Mining Document Maps
Future research
Maps for joint term-citation model, taking into account between-group link flow direction
Fully distributed map creation
Adaptive document retrieval and clustering:Bayesian network based relevance measure
Survival models for document update rate estimation
Dead link propagation methods for page freshness estimation
We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects
SAWM 2004 Mining Document Maps
Future research XXXXXX
Bayesian networks will be applied in particular to: measure relevance and classify documentsaccelerate document clustering processesconstruct a thesaurus supporting query enrichmentkeyword extractionbetween-topic dependencies estimation
Immuno-genetic systems will be used for:adaptive document clustering by referring to the mechanism of so-called metadynamicsextraction of compact characteristics of document groups by exploitation of the mechanism of construction of universal and specialized antibodiesvisualization and resolution adjustment of document maps
SAWM 2004 Mining Document Maps
Thank you!
Any
questions?
Any
questions?