Top Banner
What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical Engineering K.U.Leuven, Belgium Steunpunt O&O Statistieken Faculty of Economy K.U.Leuven, Belgium
43

What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

What's in a word ?

Term-based approaches across

bioinformatics, scientometrics and knowledge management

Patrick Glenisson

Bio-informatics groupDept Electrical Engineering K.U.Leuven, Belgium

Steunpunt O&O StatistiekenFaculty of EconomyK.U.Leuven, Belgium

Page 2: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

2

ntroductionI

Page 3: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

3

Introduction: K.U. Leuven

Faculty of Applied Sciences

Department of Electrical Engineering

Bio-informatics research

clinical bioinformatics

gene regulation bioinformatics

Research on algorithms and software development for:

Text mining

Gibbs sampling

Graphical models

Classification & clustering

Page 4: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

4

Introduction: K.U. Leuven

Faculty of Applied Sciences

Department of Electrical Engineering

Bio-informatics research

Text mining research

Combine statistical approaches with domain-specific requirements

Knowledge discovery through literature analysis in various domains:

Bio-informatics

Sciento- & Technometrics

Knowledge management

Page 5: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

5

Overview

• Bio-informatics:– gene profiling– multi-view learning

• Scientific trend mapping– clustering and bibliometric indicators

• Innovation & Spillovers– Tracing of person in science & technology

spaces

25’

5-10’

Page 6: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

6

Overview

InformationRetrieval

InformationExtraction

Full NLP parsing

Shallow Statistics

GenericProblemspecific

Domain-specific

Shallow Parsing

Document analysis &Extraction of tokens

Text mining goals

Text mining methodology

Overall approach

Page 7: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

7

ase 1:CLiterature & biological data

Page 8: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

8

Page 9: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

9

protein

Page 10: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

10

‘Post-genome’ biology focus shift :

- from single gene to gene groups- complex interactions within cellular environment

microarrays measure the simultaneous activity:

Gene expression measurement

G1G2G3

..

C1

C2

C3 ..

Sample annotations

Gen

e an

no

tati

on

s

Page 11: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

11

Clustering Interpretation

gene

conditions

Expression data

Page 12: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

12

gene

conditions

Expression data

gene expression Databases

annotations and relationsencoded as free text

PRIORINFORMATION

Integrated analysis

Page 13: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

13

Hence, 2 views:

• Text analysis for interpretation (supportive role)

• Text analytics for ‘inference’ (active role)

Page 14: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

14

A ‘historical’ quote:

`Until now it has been largely overlooked that there is little difference between retrieving an abstract from MEDLINE and downloading

an entry from a biological database’ (M. Gerstein, 2001)

12133521VEGF is associated with the development and prognosis of colorectal cancer.

 

12168088PTEN modulates angiogenesis in prostate cancer by regulating VEGF expression.

 

11866538 Vascular endothelial growth factor modulates the Tie-2:Tie-1 receptor complex

GeneRIFGO

• cell proliferation

• heparin binding

• growth factor activity

Page 15: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

15

• Controlled vocabularies are of great value when constructing interoperable and computer-parsable systems.

• Structured vocabularies are on the rise• GO• MeSH• eVOC

• Standards are systematically being adopted to store biological concepts or annotations:

• HUGO for gene names• GOA• …

Increased awareness

Page 16: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

16

(GOF) Vector space model• Document processing

– Remove punctuation & grammatical structure (`Bag of words’)– Define a vocabulary

• Identify Multi-word terms (e.g., tumor suppressor) (phrases)• Eliminate words low content (e.g., and, gene, ...) (stopwords)• Map words with same meaning (synonyms)• Strip plurals, conjugations, ... (stemming)

– Define weighing scheme and/or transformations (tf-idf,svd,..)

• index

T 1

T 3

T 2

vocabulary

gene

F.E.T.E.W
Page 17: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

17

Validity of gene indexGenes that are functionally related

should be close in text space:

Modeled wrt a background distribution of through random and permuted gene groups

Text-based coherence score

Page 18: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

18

Validity of gene indexGenes that are functionally related

should be close in text space:

Page 19: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

19

Validity of gene indexGenes that are functionally related

should be close in text space:

Page 20: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

20

Data-centered statistical scores

Coherence vs separation of clusters

Stability of a cluster solution when leaving out data

Define `optimal’ ?

Optimal number of clusters ?

C1

C3

C2

Text-based scoring

Page 21: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

21

Data-centered statistical scores

Knowledge-based scores

Enrichment of GO annotations in clusters

Literature-based scoring

Define `optimal’ ?

Optimal number of clusters ?

Page 22: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

22

Collaborative gene filtering

Page 23: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

23

TXTGate

• a platform that offers multiple ‘views’ on vast amounts of (gene-based) free-text information available in selected curated

database entries & linked scientific publications.

• incorporates term-based indices ..

• .. and use them as a starting point– to explore the text through the eyes of different domain vocabularies

– to link out to other resources by query building, or

– to sub-cluster genes based on text.

Page 24: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

24

Term-centric

Gene-centric

Domain vocabularies as ‘views’

Page 25: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

25

Query building to external DB

Page 26: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

26

• Flexible tool for analyzing gene groups (~100 genes) due to various term- and gene-centric vocab’s

• … that allow some level of interoperability with external annotation databases

• Sub-clustering gene groups useful to detect biological sub-patterns

• Reasonably robust to corrupted groups

• Gene index normalizes for unbalanced references

Features of the approach

Page 27: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

27

• Text analysis for interpretation (supportive role)

• Text analytics for ‘inference’ (active role)

Page 28: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

28

Meta-clustering text & data

• As multiple information sources are available when analyzing gene expression data, we pose the question:

“How can we analyze data in an integrated fashion to extract more information than from the expression data alone ? ”

..

Page 29: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

29

Mathematical integration

Page 30: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

30

• In each information space

– Appropriate preprocessing– Choice of distance measures

Integration of text & data

Page 31: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

31

• Combine data:

• confidence attributed to either of the two data types

• in case of distance, we can see it as a scaling constant between the norms of the data- and text representations.

Page 32: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

32

• However, distribution of distances invoke a bias Scaling problem

• Therefore, use technique from statistical meta-analysis (so-called omnibus procedure)

Expression Distance

histogram Text Distance

histogram

Page 33: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

33

M-score expression data only

M-s

core

int e

gra t

ed c

lust

e rin

gVarious cutoffs k of the cluster tree

Optimal k ?

Page 34: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

34

A peek inside

Page 35: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

35

A peek insideExpression Profile Text Profile

Strongre-enforcement

Page 36: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

36

ase 2:CSciento- & technometrics

Page 37: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

37

Mapping of Science

• Journal ‘Scientometrics’

• Full-text articles• Document cluster

analysis

• Co-word mapping• Temporal dimension:

clusters over time

Page 38: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

38

Mapping of Science

• Coupling with bibliometric indicators; – Based on reference

(hyperlink) information

– Mean reference Age– Nr Serials

Page 39: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

39

Domain studies in Patent space

30 technology classes

‘Seed’ patent

Sim

ilarit

ies

Page 40: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

40

User profiling & Author-Inventor linkage

• Name resolution– Same persons (variants, mistakes)

– Different persons (similar initials, or even full name)

Van Veldhoven Veldhoven, Van

Wim Van Veldhoven Walter Van Veldhoven

Wim Van Veldhoven Wim Van Veldhoven

VanveldhovenVan Veldhoven

Page 41: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

41

Content-based name matching

• Detect spillovers and entrepreneurial activities at (e.g.) university-level

• Matching of ‘inventors’ & ‘authors’ time-consuming semi-automated approach:

Patent DB Publication DB

Relevance ranking

Page 42: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

42

AcknowledgementsSteunpunt O&O Statistieken

Debackere K Glänzel W

ESAT / BioI / Text Mining:Coessens B Van Vooren S Janssens F Van Dromme D

ESAT / BioI:Moreau Y De Moor B

Page 43: What's in a word ? Term-based approaches across bioinformatics, scientometrics and knowledge management Patrick Glenisson Bio-informatics group Dept Electrical.

43

Thanks !?

?

CONTACT INFO:

[email protected]