Top Banner
Visualization of Visualization of Relational Text Relational Text Information Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY
29

Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Jan 19, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Visualization of Relational Text Visualization of Relational Text InformationInformation

for Biomedical Knowledge Discovery

James W. Cooper

IBM T J Watson Research Center

Hawthorne, NY

Page 2: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Overview Overview

Prior workJava based text miningComputation of unnamed relationsGraphical display of relations

Text

Text

Text

TextText

TextText

Text

Text

Page 3: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Relations between termsRelations between terms Noun phrase co-occurrence statistics [Roark,

Charniak] Choose seed words and look for terms near them.

[Brin] [Gravano, Agichtein]– Repeat

Biomedical domain– Blaschke used dictionary of common verbs– Pustejovsky found inhibit relations

Stevens, Palakal, Mostafa– Detected abstract-wide co-occurrence using

dictionary of genes and useful verbs.

Page 4: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Graphical DisplaysGraphical Displays

Biolayout – protein similarityProtInAct – interactive system using yFilesZhang – interactive 3D systemJenssen – gene network Leroy – GeneScene

Page 5: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

BioLayout –Enright and OuzounisBioLayout –Enright and Ouzounis

Spheres represent proteins and lines represent protein similarities.

Five related protein families and their corresponding relationships.

Page 6: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

ProInAct- Spencer and BennettProInAct- Spencer and Bennett

Proteins clustered by functional interaction

Page 7: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Zhang-Protein interaction mappingZhang-Protein interaction mapping

Page 8: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Jenssen – A literature networkJenssen – A literature network

Lines connect genes that have co-occurred in 1 or more papers.

Page 9: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Leroy –GeneSceneLeroy –GeneScene

Page 10: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

What would we like to do?What would we like to do?

Find scientifically meaningful connections between important terms.– Such as Swanson’s Reynaud’s disease – fish

oil connection.Allow exploration of relations by user.Filter the relations by ontology or term

typesPerform path analysisLet the user vary the graphical display.

Page 11: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Data we analyzedData we analyzed

Two sets of patent data– 584 patents on Viagra and phosphodiesterase

inhibitors.– 1514 patents on quinolones (like Cipro)

Recognized major technical terms in each patent.

Filtered organic chemical nomenclature.

Page 12: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

The Talent text mining systemThe Talent text mining system

Text Analysis and Language Engineering Tools– Finds multiword noun phrases– Does shallow parse– Can extract NPs and VGs

As well as all other sentence parts

Page 13: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

The JTalent LibraryThe JTalent Library

Java class library with JNI interface– To Talent DLL

Creates database load files of terms– Paragraph– Sentence– Offset– Term type (NP, VG)

Page 14: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

TalentShow DemoTalentShow Demo

Page 15: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

The KSS LibraryThe KSS Library

Java class library of functions for– Accessing a database (DB2, Access)– Manipulating a search engine– Manipulating tables of information created by

JTalent.

Page 16: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Database TablesDatabase Tables

Documents– Title, author, URL, ID

TermDocs– Term– Paragraph– Sentence– Offset– Type

Dictionary of terms, types and IDs– Such as MeSH

Page 17: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Computing term informationComputing term information

Compute unique terms from TermdocsCompute frequencyCompute salience

– Based on frequency– Number of docs they appear in more than

once

Page 18: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Compute term relationsCompute term relations

Named relations based on abbreviation expansions.

Unnamed relations based on proximity, with weight based on how frequently they occur near each other.

Mutual information weight:

21

logfreqfreq

paircounttotaltermsm

Page 19: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Tuning Computed relationsTuning Computed relations

Select only terms above a salience threshold.

Only relations in which one or both are members of an ontology.

Store relations in a database table for rapid access:

Term | weight | term

Page 20: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Original SystemOriginal System

Visual clientSOAP server

– Queries database to get relations– Round trip for each new query

Instead, we export the data for the user to visualize as they wish.

Page 21: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Exporting relationsExporting relations Save relations and ontology information in xml file. <relation>

– <term> <iq>78</iq> <source>MeSH</source> <relationDocuments>

– <doc> 34</doc– </term>– <term> </term>

</relation> This XML file is a portable version of the computed

relations that we can then use with any number of viewers.

Page 22: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

A Graphical Relations ViewerA Graphical Relations Viewer

Creates a Java Relations object for each relation it reads from the XML file.

Inserts them into a Trie structure based on lower cased first term.– If there is already a Relation at that point, it

adds them to a Vector for that term.Creates an alphabetical list of all terms in a

2nd Trie.

Page 23: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Using the ViewerUsing the Viewer

When you enter part of a term, it shows all terms starting with that fragment in the left list box.

When you click on a term, it shows all its relations in the right list box.

Page 24: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Lexical NavigationLexical Navigation

Displays relations between terms graphically and allows you to explore them without formulating a specific query.

Page 25: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Possible enhancementsPossible enhancements

Show only terms belonging to an ontology.Show only higher IQ termsShow the documents the relations occur in.Show the ontology reference.Show computed pathsShow more kinds of named relations.

– Inhibits, expresses

Page 26: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Evaluations of Information Evaluations of Information VisualizationVisualization Few, if any, graphical displays have been

evaluated thus far for effectiveness. Usability studies are hard to construct and carry

out. Intuition seems to show

– that exploration may result in discoveries.– Relations more than one step apart seem best

displayed graphically. Remains to be shown that such visualizations are

actually useful.

Page 27: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

Differences in IntentDifferences in Intent

Displays may represent information your system has discovered.– Gene – protein relations

Or they may represent data from which the user may discover new information.– New 2nd or 3rd order relationships

These are rather different applications of visualization technology

Page 28: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

SummarySummary

Java-based text mining systemDatabase of terms and positionsComputation of relationsExport as XMLGraphical relations viewerThe value of such visual interfaces has not

yet been established.

Page 29: Visualization of Relational Text Information for Biomedical Knowledge Discovery James W. Cooper IBM T J Watson Research Center Hawthorne, NY.

AcknowledgementsAcknowledgements

Bhavani Iyer – XML exportEric Brown – DictMatcher hash codeDaniel Tunkelang – graphical layoutBob Mack – paper suggestions