Top Banner
Clustering of yeast genes using Literature Mining CS 6910 Project Advisor: Dr.Venu Dasigi Student: Maitreyee Bhise
39

Clustering of yeast genes using Literature Mining

Jan 15, 2017

Download

Documents

Maitreyee Bhise
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clustering of yeast genes using Literature Mining

Clustering of yeast genes using

Literature MiningCS 6910 Project

Advisor: Dr.Venu DasigiStudent: Maitreyee Bhise

Page 2: Clustering of yeast genes using Literature Mining

Contents

• Text mining• Overview of the project • Weight Metrics• Ohio Supercomputer (MEDLINE database)• Implementation• Results and Analysis• Limitations• Future Scope

Page 3: Clustering of yeast genes using Literature Mining

What is text mining?

• Provides mechanism to handle large amount of data

• Integrates all sources and add meaning to it

Page 4: Clustering of yeast genes using Literature Mining

Background Set

• Reference Set used to compare query set• Dictionary of documents and respective words

of importance• Restricted Background set (used in this project)• Unrestricted Background set

• Restricted Background set: Only those documents that satisfy a condition

• Unrestricted Background set: All the documents

Page 5: Clustering of yeast genes using Literature Mining

Query Set and Background Set used

• Query Set: 44 Tables for each gene• Each table with abstracts that contain

the gene name in the title• Restricted Background set: Union of 44

gene tables• Background set is a table with all

keywords from 44 tables

Page 6: Clustering of yeast genes using Literature Mining

Overview

• Restricted background set created using MEDLINE (collection of biological documents)

• Used to compare frequency of a word in query set against background set

• 44 gene keywords extracted, analyzed and clustered using weight metrics

• Frequency of each keyword in a query set is compared with background set

• Gene characteristics are discovered on the basis of computed weights

Page 7: Clustering of yeast genes using Literature Mining

Overview (cont'd)

• Entire MEDLINE collection had been preprocessed earlier which was added in this work

• Project uses a different background set• Implementation on Ohio Supercomputer

Page 8: Clustering of yeast genes using Literature Mining

MEDLINE

• MEDLINE is a collection of biological documents provided by National Library of Medicine (NLM)

• Consists of 23 million abstracts• Data provided in XML format• XML data parsed and loaded in Ohio

Supercomputer• Oakley is the newly built server with 8,300+

core HP Intel Xeon machine

Page 9: Clustering of yeast genes using Literature Mining

Portable Batch Script

• #PBS –l walltime = hh:mm:ss (running time required for the job)

• #PBS –l nodes = 2 :ppn = 12 (no. of nodes and processors per node required)

• #PBS –m abe (emails the user when job aborts/begins/ends)

• To submit the batch : qsub Job_Name

Page 10: Clustering of yeast genes using Literature Mining

MEDLINE Table Structure

• 3 Tables:• N_Word • N_Document• N_WordDocument

• This project uses only N_Document and N_WordDocument tables

Page 11: Clustering of yeast genes using Literature Mining

MEDLINE Table Structure (cont'd)

Page 12: Clustering of yeast genes using Literature Mining

Weight Metrics

• Two statistical parameters:• Z-Score• TF-IDF (Term Frequency-Inverse

Document Frequency)• Calculates frequency of a keyword in query

set in comparison with some reference set• Helps to discriminate high information

content words of a document

Page 13: Clustering of yeast genes using Literature Mining

Stop Words Removal

• Words which are commonly used but doesn’t add meaning

• Used Stop word list provided by PubMed

• Separate table created to store stop word list

• Stop words are removed using simple join with this table

Page 14: Clustering of yeast genes using Literature Mining

Z-Score

• Z-Score of a word ‘a’ in a gene ‘n’,

Zₐⁿ = Where=√( ) = Document frequency of a word a for gene n = mean frequency of a word a = sum of for all genes = Standard Deviation of word aN = Number of Genes (or set of group of documents)

Page 15: Clustering of yeast genes using Literature Mining

Z-Score (cont'd)(Document Frequency calculation)

• Captures the strength of a keywords in a collection set

• Emphasis on distribution of the keyword across genes

• It is the number of documents that contain the word

• Calculated with respect to a gene (or related group of documents)

Page 16: Clustering of yeast genes using Literature Mining

Z-Score for gene Cwp1(cont'd)(Document Frequency calculation)

Page 17: Clustering of yeast genes using Literature Mining

Z-Score for Cwp1(cont'd)(Mean Frequency calculation)

• Sum of document frequency corresponding to each keyword from different genes divided by total number of genes.

Page 18: Clustering of yeast genes using Literature Mining

Z-Score for gene Cwp1 (cont’d)( Numerator calculation)

Page 19: Clustering of yeast genes using Literature Mining

Sum of square of Numeratorfor gene Cwp1

Page 20: Clustering of yeast genes using Literature Mining

Z-Score (cont’d) (Standard Deviation Calculation)

• Tells deviation from Normal• Lesser the value of standard deviation,

more is the possibility of high-information content word

• For lesser value of standard deviation, z-score will be high

Page 21: Clustering of yeast genes using Literature Mining

Z-Score for gene Cwp1 (cont’d)(Standard Deviation Calculation)

Page 22: Clustering of yeast genes using Literature Mining

Z-Score for gene Cwp1(cont’d)

Page 23: Clustering of yeast genes using Literature Mining

TF-IGF Calculations

• TFIGF of a word ‘a’ in gene ‘n’ , TFIGFₐⁿ = TFₐⁿ * IGFₐWhere TFₐⁿ = ∑ tfₐ ͩ

And, IGFₐ = log GFₐ is number of genes that contain the word ‘a’• Emphasis on importance of a keyword

within a gene

Page 24: Clustering of yeast genes using Literature Mining

TF-IGF for gene “Cwp1” (cont’d)(Term Frequency Calculations)

Page 25: Clustering of yeast genes using Literature Mining

TF-IGF (cont’d)(Group Frequency Calculations for

gene “Cwp1”)

Page 26: Clustering of yeast genes using Literature Mining

TF-IGF (cont’d)(Inverse Group Frequency

Calculations for gene “Cwp1”)

Page 27: Clustering of yeast genes using Literature Mining

TF-IGF for gene “Cwp1” (cont’d)

Page 28: Clustering of yeast genes using Literature Mining

Results

• Determine high-information content words for a gene

• Higher the z-score value of a keyword in a gene, more unique it describes functionality the gene

• Higher the TF-IGF value of a keyword in a gene, more unique it is in that gene as compared to other genes

• TF-IGF yields better quality keywords as filters unwanted keywords

Page 29: Clustering of yeast genes using Literature Mining

Top keywords for gene “Cwp1”using z-score

• Irrespective of the document frequency, top 75 out of 1612 keywords have same high z-score value 6.245

• Top 75 keywords are unique to Cwp1

Page 30: Clustering of yeast genes using Literature Mining

Limitations

• Some types of parsing errors which results in false positives

Page 31: Clustering of yeast genes using Literature Mining

Top keywords for gene “Cwp1”using TF-IGF

• Better quality keywords obtained from TF-IGF

Page 32: Clustering of yeast genes using Literature Mining

Cluster 3.0 Clustering Software

• Used Cluster 3.0 open source clustering software specially designed for gene expression data analysis

• Developed at Stanford University• Can run on Windows/Mac/Linux

Page 33: Clustering of yeast genes using Literature Mining

Yeast Genes Grouped using Z-ScoreGenes Cluster

Gic2,Rad27,Dun1,Tel2,Cdc20,Far1

1

Cln1,Cln2,Cdc6 2Gic1,Ace2,Mcm3 3Exg1,Htb2,Cts1 4

Mcm2 5Mnn1,Och1,Hho1,Mcm6 6

Msb2,Rsr1,Bud9,Kre6,Cwp1,Clb5,Clb6,Rnr1,Cdc21,Cdc45,Htb1,

Hta1,Hta2,Hht1,Tem1

7

Rad51 8Ste2 9

Page 34: Clustering of yeast genes using Literature Mining

Yeast Genes Grouped using TF-IGFGenes Cluster

Cdc20 1Cln1,Cln2 2Swi5,Ace2 3

Cdc6,Mcm3,Mcm6,Cdc46 4Mcm2 5Cdc45 6

Msb2,Rsr1,Bud9,Mnn1,Och1,Exg1,Kre6,Cwp1,Clb5,Clb6,Rnr1,Rad27,Cdc21,Dun1,Htb1,Htb2,Hta1,Hta2,Hho1,Hht1,Tel2,Tem1,Clb2,Cts1,Gi

c1,Gic2

7

Rad51 8Ste2,Far1 9

Page 35: Clustering of yeast genes using Literature Mining

Analysis

• Z-Score is independent of Document frequency only when word is unique to the gene

• Cln1 GeneCln2 Gene

Page 36: Clustering of yeast genes using Literature Mining

Analysis (cont'd.)

• Words found with low frequency but with high z-score

Page 37: Clustering of yeast genes using Literature Mining

Analysis (cont'd.)

• Words with high frequency have low z-scores

Page 38: Clustering of yeast genes using Literature Mining

Future Scope

• New Background set can be used. Example-• Abstracts that contain Gene name and

its related words in the title• Unrestricted Background set

• Applying Stemming Algorithm on the MEDLINE database

• Concepts of Latent Semantic Metrics can be applied by preserving the order of words

Page 39: Clustering of yeast genes using Literature Mining

Special Thanks To…

Dr. Venu DasigiDr. Vipa Phuntumart

Dr. Ray KresmanPukar Hamal