What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

What's in a word ?

Term-based approaches across

bioinformatics, scientometrics and knowledge management

Patrick Glenisson

Bio-informatics groupDept Electrical Engineering K.U.Leuven, Belgium

Steunpunt O&O StatistiekenFaculty of EconomyK.U.Leuven, Belgium

2

ntroductionI

3

Introduction: K.U. Leuven

Faculty of Applied Sciences

Department of Electrical Engineering

Bio-informatics research

clinical bioinformatics

gene regulation bioinformatics

Research on algorithms and software development for:

Text mining

Gibbs sampling

Graphical models

Classification & clustering

4

Introduction: K.U. Leuven

Faculty of Applied Sciences

Department of Electrical Engineering

Bio-informatics research

Text mining research

Combine statistical approaches with domain-specific requirements

Knowledge discovery through literature analysis in various domains:

Bio-informatics

Sciento- & Technometrics

Knowledge management

5

Overview

• Bio-informatics:– gene profiling– multi-view learning

• Scientific trend mapping– clustering and bibliometric indicators

• Innovation & Spillovers– Tracing of person in science & technology

spaces

25’

5-10’

6

Overview

InformationRetrieval

InformationExtraction

Full NLP parsing

Shallow Statistics

GenericProblemspecific

Domain-specific

Shallow Parsing

Document analysis &Extraction of tokens

Text mining goals

Text mining methodology

Overall approach

7

ase 1:CLiterature & biological data

8

9

protein

10

‘Post-genome’ biology focus shift :

- from single gene to gene groups- complex interactions within cellular environment

microarrays measure the simultaneous activity:

Gene expression measurement

G1G2G3

..

C1

C2

C3 ..

Sample annotations

Gen

e an

no

tati

on

s

11

Clustering Interpretation

gene

conditions

Expression data

12

gene

conditions

Expression data

gene expression Databases

annotations and relationsencoded as free text

PRIORINFORMATION

Integrated analysis

13

Hence, 2 views:

• Text analysis for interpretation (supportive role)

• Text analytics for ‘inference’ (active role)

14

A ‘historical’ quote:

Ùntil now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading

an entry from a biological database’ (M. Gerstein, 2001)

12133521VEGF is associated with the development and prognosis of colorectal cancer.

12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression.

11866538 Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex

GeneRIFGO

• cell proliferation

• heparin binding

• growth factor activity

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&doptcmdl=DocSum&list_uids=12133521



15

• Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.

• Structured vocabularies are on the rise• GO• MeSH• eVOC

• Standards are systematically being adopted to store biological concepts or annotations:

• HUGO for gene names• GOA• …

Increased awareness

16

(GOF) Vector space model• Document processing

– Remove punctuation & grammatical structure (`Bag of words’)– Define a vocabulary

• Identify Multi-word terms (e.g., tumor suppressor) (phrases)• Eliminate words low content (e.g., and, gene, ...) (stopwords)• Map words with same meaning (synonyms)• Strip plurals, conjugations, ... (stemming)

– Define weighing scheme and/or transformations (tf-idf,svd,..)

• index

T 1

T 3

T 2

vocabulary

gene

F.E.T.E.W

17

Validity of gene indexGenes that are functionally related

should be close in text space:

Modeled wrt a background distribution of through random and permuted gene groups

Text-based coherence score

18



19



20

Data-centered statistical scores

Coherence vs separation of clusters

Stability of a cluster solution when leaving out data

Define òptimal’ ?

Optimal number of clusters ?

C1

C3

C2

Text-based scoring

21

Data-centered statistical scores

Knowledge-based scores

Enrichment of GO annotations in clusters

Literature-based scoring

Define òptimal’ ?

Optimal number of clusters ?

22

Collaborative gene filtering

23

TXTGate

• a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated

database entries & linked scientific publications.

• incorporates term-based indices ..

• .. and use them as a starting point– to explore the text through the eyes of different domain vocabularies

– to link out to other resources by query building, or

– to sub-cluster genes based on text.

24

Term-centric

Gene-centric

Domain vocabularies as ‘views’

25

Query building to external DB

26

• Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s

• … that allow some level of interoperability with external annotation databases

• Sub-clustering gene groups useful to detect biological sub-patterns

• Reasonably robust to corrupted groups

• Gene index normalizes for unbalanced references

Features of the approach

27

• Text analysis for interpretation (supportive role)

• Text analytics for ‘inference’ (active role)

28

Meta-clustering text & data

• As multiple information sources are available when analyzing gene expression data, we pose the question:

“How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ”

..

29

Mathematical integration

30

• In each information space

– Appropriate preprocessing– Choice of distance measures

Integration of text & data

31

• Combine data:

• confidence attributed to either of the two data types

• in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.

32

• However, distribution of distances invoke a bias Scaling problem

• Therefore, use technique from statistical meta-analysis (so-called omnibus procedure)

Expression Distance

histogram Text Distance

histogram

33

M-score expression data only

M-s

core

int e

gra t

ed c

lust

e rin

gVarious cutoffs k of the cluster tree

Optimal k ?

34

A peek inside

35

A peek insideExpression Profile Text Profile

Strongre-enforcement

36

ase 2:CSciento- & technometrics

37

Mapping of Science

• Journal ‘Scientometrics’

• Full-text articles• Document cluster

analysis

• Co-word mapping• Temporal dimension:

clusters over time

38

Mapping of Science

• Coupling with bibliometric indicators; – Based on reference

(hyperlink) information

– Mean reference Age– Nr Serials

39

Domain studies in Patent space

30 technology classes

‘Seed’ patent

Sim

ilarit

ies

40

User profiling & Author-Inventor linkage

• Name resolution– Same persons (variants, mistakes)

– Different persons (similar initials, or even full name)

Van Veldhoven Veldhoven, Van

Wim Van Veldhoven Walter Van Veldhoven

Wim Van Veldhoven Wim Van Veldhoven

VanveldhovenVan Veldhoven

41

Content-based name matching

• Detect spillovers and entrepreneurial activities at (e.g.) university-level

• Matching of ‘inventors’ & ‘authors’ time-consuming semi-automated approach:

Patent DB Publication DB

Relevance ranking

42

AcknowledgementsSteunpunt O&O Statistieken

Debackere K Glänzel W

ESAT / BioI / Text Mining:Coessens B Van Vooren S Janssens F Van Dromme D

ESAT / BioI:Moreau Y De Moor B

http://www.steunpuntoos.be/kd.html

http://www.steunpuntoos.be/wg.html

43

Thanks !?

?

CONTACT INFO:

[email protected]

What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

Documents

belgium slide

protein slide

single gene

text analysis

inference active role

overview bioinformatics

vegf expression

literature analysis