Top Banner
WP 10 Multilingual Access Philipp Daumke, Stefan Schulz
24
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

WP 10 Multilingual Access

Philipp Daumke, Stefan Schulz

Page 2: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Multilingual Access - Rationale

English as First Language

English as Second Language

No English Language Skills

English as a Foreign Language

•< 70 % of the world's scientists read in English•80 % of the world's electronically stored information is in

English•90 % English articles in Medline (2000)

Sources: The British Council, 2005Fung ICH: Open access for the non-English-speaking world: overcoming the language barrier. Emerging Themes in Epidemiology, 2008

Page 3: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Non-native speakers

• Broad range of command of English • Reading skills > writing skills• Reduced active vocabulary

Difficulty in formulating precise queries

English as Second Language

English as a Foreign Language

Page 4: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Korrelation von Hypertonie und

Läsion der Weißen Substanz…

“Correlation of high blood

pressure and lesion of the white

substance”

Cross-language document retrieval example

Page 5: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Korrelation von Hypertonie und

Läsion der Weißen Substanz…

“Correlation of high blood

pressure and lesion of the white

substance”

Cross-language document retrieval example

Page 6: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Korrelation von Hypertonie und

Läsion der Weißen Substanz…

“Correlation of high blood

pressure and lesion of the white

substance”

Cross-language document retrieval example

Page 7: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

BootStrep WP 10 - Multilingual access

• Objectives: – To provide a multilingual search interface to the BootStrep

Biolexicon / Bioontology

– We do NOT propose to deliver a multilingual extension of the

BootStrep biolexicon

• Query Languages: French, German, English, (Italian)

• Output language: English

• Method: Subword-based semantic indexing

• Resources:

– MorphoSaurus multilingual subword lexicon & thesaurus

– MorphoSaurus Semantic Indexer

Page 8: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Technique: Morphosemantic Indexing

• Subword-based, multilingual semantic indexing for document retrieval

• Subwords are atomic, conceptual or linguistic units:

– Stems: stomach, gastr, diaphys– Prefixes: anti-, bi-, hyper- – Suffixes: -ary, -ion, -itis– Infixes: -o-, -s-

• Equivalence classes contain synonymous subwords and their translations:

– #derma = { derm, cutis, skin, haut, kutis, pele, cutis, piel, … }

– #inflamm = { inflamm, -itic, -itis, -phlog, entzuend, -itis, -itisch, inflam, flog, inflam, flog, ... }

Page 9: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Segmentation:

Myo|kard|itis

Herz|muskel|entzünd|ung

Inflamm|ation of the heart muscle

muscle

myo

muskel

muscul

inflamm

-itis

inflam

entzünd

Eq Class

subword herzheart

card

corazon

card

INFLAMMMUSCLE

HEART

Subword Thesaurus Structure

Indexation:

#muscle #heart #inflamm

#heart #muscle #inflamm

#inflamm #heart #muscle

• Thesaurus:~21.000 equivalence classes (MIDs)

• Lexicon entries:– English: ~23.000– German: ~24.000– Portuguese: ~15.000– Spanish : ~11.000– French: ~ 8.000– Swedish: ~10.000– Italian: ~ 4.000

Page 10: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Indexing Pipeline

Page 11: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Indexing Pipeline

Page 12: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Indexing Pipeline

Page 13: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Indexing Pipeline

Page 14: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Subword-based document transformation

Morphosemanticindexer

Page 15: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Subword-Based Search

Korrelation von Hypertonie und

Läsion der Weißen Substanz…

#correl #hyper #tens #lesion #whit #matter

Page 16: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Subword-based query transformation

Korrelation von Hypertonie und

Läsion der Weißen Substanz…

#correl #hyper #tens #lesion #whit #matter

Page 17: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Adapting Morphosemantic Indexing of BootStrep

• BootStrep terminology mostly disjoint from existing clinical terminology

• Enhancement of data resources (e.g. for acronym resolution, multi-term equivalences)

• BootStrep Terms for multilingual access

– Gene Ontology , InterPro, IntAct, Gene Regulation Ontology, Species

• Medline subcorpus (about E. coli gene regulation)

Page 18: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Ongoing/Completed Tasks

• Manual Training of MorphoSaurus-Lexica by means of the BootStrep corpora

(en, de, fr)

• Multilingual Terminology Browser– 2268 GO terms + translations

– 6925 InterPro terms + translations

– 2082 IntAct terms + translations

– URL: http://www.medinf.uni-freiburg.de/demo/BootStrepBrowser/

• Multilingual Search Engine:– Document collection: BootStrep-Medline subset

– Languages: English, German, French

– Query modes: Author, Title, title + keywords, All

Page 19: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Terminology Browser

Search Results

Further Information

Navigation

Page 20: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Terminology Browser

Page 21: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

Multilingual Search Engine

Page 22: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

To do: Tools and Resources

• BootStrep-Browser– Integration of Species– Integration of the Gene Regulation Ontology

• Multilingual Search Engine– Multilingual treatment of acronyms– Inclusion of species synonym list– Dealing with mixed queries (German-English, English-French)– Integration with the fact store

• Continue lexicon population – Italian terms ?

Page 23: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

To do: Evaluation

• Creation of a gold standard

– Typical English queries

– Find all relevant documents in the E.coli subset

• CLIR experiments

– Translate queries to French and German

– Compare mean average precision

• Reuse of already existing routines on standard benchmarks (OHSUMED, IMAGEClef)

Page 24: WP 10 Multilingual Access Philipp Daumke, Stefan Schulz.

ImageCLEFMed Benchmark

0

10

20

30

40

50

60

70

80

90

100

Percent of Baseline

EN DE PT SP FR SV AV

Language and Condition

Top 20 Average Precision

Query Translation

Morphosaurus

Morphosaurus+D

• Baseline: monolingual – Stemmed English queries– Stemmed English texts

• Query translation – Google translator– Multilingual dictionary

compiled from UMLS

• Morphosemantic Indexing – Interlingual representation of

user queries and documents

• Morphosemantic Indexing– incorporating

disambiguation module

En

glis

h

Germ

an

Port

ug

uese

Sp

an

ish

Fren

ch

Sw

ed

ish

Avera

ge

Percent ofBaseline

Top 20 Average Precision