Page 1
What's in a word ?
Term-based approaches across
bioinformatics, scientometrics and knowledge management
Patrick Glenisson
Bio-informatics groupDept Electrical Engineering K.U.Leuven, Belgium
Steunpunt O&O StatistiekenFaculty of EconomyK.U.Leuven, Belgium
Page 3
3
Introduction: K.U. Leuven
Faculty of Applied Sciences
Department of Electrical Engineering
Bio-informatics research
clinical bioinformatics
gene regulation bioinformatics
Research on algorithms and software development for:
Text mining
Gibbs sampling
Graphical models
Classification & clustering
Page 4
4
Introduction: K.U. Leuven
Faculty of Applied Sciences
Department of Electrical Engineering
Bio-informatics research
Text mining research
Combine statistical approaches with domain-specific requirements
Knowledge discovery through literature analysis in various domains:
Bio-informatics
Sciento- & Technometrics
Knowledge management
Page 5
5
Overview
• Bio-informatics:– gene profiling– multi-view learning
• Scientific trend mapping– clustering and bibliometric indicators
• Innovation & Spillovers– Tracing of person in science & technology
spaces
25’
5-10’
Page 6
6
Overview
InformationRetrieval
InformationExtraction
Full NLP parsing
Shallow Statistics
GenericProblemspecific
Domain-specific
Shallow Parsing
Document analysis &Extraction of tokens
Text mining goals
Text mining methodology
Overall approach
Page 7
7
ase 1:CLiterature & biological data
Page 10
10
‘Post-genome’ biology focus shift :
- from single gene to gene groups- complex interactions within cellular environment
microarrays measure the simultaneous activity:
Gene expression measurement
G1G2G3
..
C1
C2
C3 ..
Sample annotations
Gen
e an
no
tati
on
s
Page 11
11
Clustering Interpretation
gene
conditions
Expression data
Page 12
12
gene
conditions
Expression data
gene expression Databases
annotations and relationsencoded as free text
PRIORINFORMATION
Integrated analysis
Page 13
13
Hence, 2 views:
• Text analysis for interpretation (supportive role)
• Text analytics for ‘inference’ (active role)
Page 14
14
A ‘historical’ quote:
`Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading
an entry from a biological database’ (M. Gerstein, 2001)
12133521VEGF is associated with the development and prognosis of colorectal cancer.
12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression.
11866538 Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex
GeneRIFGO
• cell proliferation
• heparin binding
• growth factor activity
Page 15
15
• Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.
• Structured vocabularies are on the rise• GO• MeSH• eVOC
• Standards are systematically being adopted to store biological concepts or annotations:
• HUGO for gene names• GOA• …
Increased awareness
Page 16
16
(GOF) Vector space model• Document processing
– Remove punctuation & grammatical structure (`Bag of words’)– Define a vocabulary
• Identify Multi-word terms (e.g., tumor suppressor) (phrases)• Eliminate words low content (e.g., and, gene, ...) (stopwords)• Map words with same meaning (synonyms)• Strip plurals, conjugations, ... (stemming)
– Define weighing scheme and/or transformations (tf-idf,svd,..)
• index
T 1
T 3
T 2
vocabulary
gene
Page 17
17
Validity of gene indexGenes that are functionally related
should be close in text space:
Modeled wrt a background distribution of through random and permuted gene groups
Text-based coherence score
Page 18
18
Validity of gene indexGenes that are functionally related
should be close in text space:
Page 19
19
Validity of gene indexGenes that are functionally related
should be close in text space:
Page 20
20
Data-centered statistical scores
Coherence vs separation of clusters
Stability of a cluster solution when leaving out data
Define `optimal’ ?
Optimal number of clusters ?
C1
C3
C2
Text-based scoring
Page 21
21
Data-centered statistical scores
Knowledge-based scores
Enrichment of GO annotations in clusters
Literature-based scoring
Define `optimal’ ?
Optimal number of clusters ?
Page 22
22
Collaborative gene filtering
Page 23
23
TXTGate
• a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated
database entries & linked scientific publications.
• incorporates term-based indices ..
• .. and use them as a starting point– to explore the text through the eyes of different domain vocabularies
– to link out to other resources by query building, or
– to sub-cluster genes based on text.
Page 24
24
Term-centric
Gene-centric
Domain vocabularies as ‘views’
Page 25
25
Query building to external DB
Page 26
26
• Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s
• … that allow some level of interoperability with external annotation databases
• Sub-clustering gene groups useful to detect biological sub-patterns
• Reasonably robust to corrupted groups
• Gene index normalizes for unbalanced references
Features of the approach
Page 27
27
• Text analysis for interpretation (supportive role)
• Text analytics for ‘inference’ (active role)
Page 28
28
Meta-clustering text & data
• As multiple information sources are available when analyzing gene expression data, we pose the question:
“How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ”
..
Page 29
29
Mathematical integration
Page 30
30
• In each information space
– Appropriate preprocessing– Choice of distance measures
Integration of text & data
Page 31
31
• Combine data:
• confidence attributed to either of the two data types
• in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.
Page 32
32
• However, distribution of distances invoke a bias Scaling problem
• Therefore, use technique from statistical meta-analysis (so-called omnibus procedure)
Expression Distance
histogram Text Distance
histogram
Page 33
33
M-score expression data only
M-s
core
int e
gra t
ed c
lust
e rin
gVarious cutoffs k of the cluster tree
Optimal k ?
Page 35
35
A peek insideExpression Profile Text Profile
Strongre-enforcement
Page 36
36
ase 2:CSciento- & technometrics
Page 37
37
Mapping of Science
• Journal ‘Scientometrics’
• Full-text articles• Document cluster
analysis
• Co-word mapping• Temporal dimension:
clusters over time
Page 38
38
Mapping of Science
• Coupling with bibliometric indicators; – Based on reference
(hyperlink) information
– Mean reference Age– Nr Serials
Page 39
39
Domain studies in Patent space
30 technology classes
‘Seed’ patent
Sim
ilarit
ies
Page 40
40
User profiling & Author-Inventor linkage
• Name resolution– Same persons (variants, mistakes)
– Different persons (similar initials, or even full name)
Van Veldhoven Veldhoven, Van
Wim Van Veldhoven Walter Van Veldhoven
Wim Van Veldhoven Wim Van Veldhoven
VanveldhovenVan Veldhoven
Page 41
41
Content-based name matching
• Detect spillovers and entrepreneurial activities at (e.g.) university-level
• Matching of ‘inventors’ & ‘authors’ time-consuming semi-automated approach:
Patent DB Publication DB
Relevance ranking
Page 42
42
AcknowledgementsSteunpunt O&O Statistieken
Debackere K Glänzel W
ESAT / BioI / Text Mining:Coessens B Van Vooren S Janssens F Van Dromme D
ESAT / BioI:Moreau Y De Moor B
Page 43
43
Thanks !?
?
CONTACT INFO:
[email protected]