Clustering of yeast genes using Literature Mining

Clustering of yeast genes using

Literature MiningCS 6910 Project

Advisor: Dr.Venu DasigiStudent: Maitreyee Bhise

Contents

• Text mining• Overview of the project • Weight Metrics• Ohio Supercomputer (MEDLINE database)• Implementation• Results and Analysis• Limitations• Future Scope

What is text mining?

• Provides mechanism to handle large amount of data

• Integrates all sources and add meaning to it

Background Set

• Reference Set used to compare query set• Dictionary of documents and respective words

of importance• Restricted Background set (used in this project)• Unrestricted Background set

• Restricted Background set: Only those documents that satisfy a condition

• Unrestricted Background set: All the documents

Query Set and Background Set used

• Query Set: 44 Tables for each gene• Each table with abstracts that contain

the gene name in the title• Restricted Background set: Union of 44

gene tables• Background set is a table with all

keywords from 44 tables

Overview

• Restricted background set created using MEDLINE (collection of biological documents)

• Used to compare frequency of a word in query set against background set

• 44 gene keywords extracted, analyzed and clustered using weight metrics

• Frequency of each keyword in a query set is compared with background set

• Gene characteristics are discovered on the basis of computed weights

Overview (cont'd)

• Entire MEDLINE collection had been preprocessed earlier which was added in this work

• Project uses a different background set• Implementation on Ohio Supercomputer

MEDLINE

• MEDLINE is a collection of biological documents provided by National Library of Medicine (NLM)

• Consists of 23 million abstracts• Data provided in XML format• XML data parsed and loaded in Ohio

Supercomputer• Oakley is the newly built server with 8,300+

core HP Intel Xeon machine

Portable Batch Script

• #PBS –l walltime = hh:mm:ss (running time required for the job)

• #PBS –l nodes = 2 :ppn = 12 (no. of nodes and processors per node required)

• #PBS –m abe (emails the user when job aborts/begins/ends)

• To submit the batch : qsub Job_Name

MEDLINE Table Structure

• 3 Tables:• N_Word • N_Document• N_WordDocument

• This project uses only N_Document and N_WordDocument tables

MEDLINE Table Structure (cont'd)

Weight Metrics

• Two statistical parameters:• Z-Score• TF-IDF (Term Frequency-Inverse

Document Frequency)• Calculates frequency of a keyword in query

set in comparison with some reference set• Helps to discriminate high information

content words of a document

Stop Words Removal

• Words which are commonly used but doesn’t add meaning

• Used Stop word list provided by PubMed

• Separate table created to store stop word list

• Stop words are removed using simple join with this table

Z-Score

• Z-Score of a word ‘a’ in a gene ‘n’,

Zₐⁿ = Where=√( ) = Document frequency of a word a for gene n = mean frequency of a word a = sum of for all genes = Standard Deviation of word aN = Number of Genes (or set of group of documents)

Z-Score (cont'd)(Document Frequency calculation)

• Captures the strength of a keywords in a collection set

• Emphasis on distribution of the keyword across genes

• It is the number of documents that contain the word

• Calculated with respect to a gene (or related group of documents)

Z-Score for gene Cwp1(cont'd)(Document Frequency calculation)

Z-Score for Cwp1(cont'd)(Mean Frequency calculation)

• Sum of document frequency corresponding to each keyword from different genes divided by total number of genes.

Z-Score for gene Cwp1 (cont’d)( Numerator calculation)

Sum of square of Numeratorfor gene Cwp1

Z-Score (cont’d) (Standard Deviation Calculation)

• Tells deviation from Normal• Lesser the value of standard deviation,

more is the possibility of high-information content word

• For lesser value of standard deviation, z-score will be high

Z-Score for gene Cwp1 (cont’d)(Standard Deviation Calculation)

Z-Score for gene Cwp1(cont’d)

TF-IGF Calculations

• TFIGF of a word ‘a’ in gene ‘n’ , TFIGFₐⁿ = TFₐⁿ * IGFₐWhere TFₐⁿ = ∑ tfₐ ͩ

And, IGFₐ = log GFₐ is number of genes that contain the word ‘a’• Emphasis on importance of a keyword

within a gene

TF-IGF for gene “Cwp1” (cont’d)(Term Frequency Calculations)

TF-IGF (cont’d)(Group Frequency Calculations for

gene “Cwp1”)

TF-IGF (cont’d)(Inverse Group Frequency

Calculations for gene “Cwp1”)

TF-IGF for gene “Cwp1” (cont’d)

Results

• Determine high-information content words for a gene

• Higher the z-score value of a keyword in a gene, more unique it describes functionality the gene

• Higher the TF-IGF value of a keyword in a gene, more unique it is in that gene as compared to other genes

• TF-IGF yields better quality keywords as filters unwanted keywords

Top keywords for gene “Cwp1”using z-score

• Irrespective of the document frequency, top 75 out of 1612 keywords have same high z-score value 6.245

• Top 75 keywords are unique to Cwp1

Limitations

• Some types of parsing errors which results in false positives

Top keywords for gene “Cwp1”using TF-IGF

• Better quality keywords obtained from TF-IGF

Cluster 3.0 Clustering Software

• Used Cluster 3.0 open source clustering software specially designed for gene expression data analysis

• Developed at Stanford University• Can run on Windows/Mac/Linux

Yeast Genes Grouped using Z-ScoreGenes Cluster

Gic2,Rad27,Dun1,Tel2,Cdc20,Far1

1

Cln1,Cln2,Cdc6 2Gic1,Ace2,Mcm3 3Exg1,Htb2,Cts1 4

Mcm2 5Mnn1,Och1,Hho1,Mcm6 6

Msb2,Rsr1,Bud9,Kre6,Cwp1,Clb5,Clb6,Rnr1,Cdc21,Cdc45,Htb1,

Hta1,Hta2,Hht1,Tem1

7

Rad51 8Ste2 9

Yeast Genes Grouped using TF-IGFGenes Cluster

Cdc20 1Cln1,Cln2 2Swi5,Ace2 3

Cdc6,Mcm3,Mcm6,Cdc46 4Mcm2 5Cdc45 6

Msb2,Rsr1,Bud9,Mnn1,Och1,Exg1,Kre6,Cwp1,Clb5,Clb6,Rnr1,Rad27,Cdc21,Dun1,Htb1,Htb2,Hta1,Hta2,Hho1,Hht1,Tel2,Tem1,Clb2,Cts1,Gi

c1,Gic2

7

Rad51 8Ste2,Far1 9

Analysis

• Z-Score is independent of Document frequency only when word is unique to the gene

• Cln1 GeneCln2 Gene

Analysis (cont'd.)

• Words found with low frequency but with high z-score

Analysis (cont'd.)

• Words with high frequency have low z-scores

Future Scope

• New Background set can be used. Example-• Abstracts that contain Gene name and

its related words in the title• Unrestricted Background set

• Applying Stemming Algorithm on the MEDLINE database

• Concepts of Latent Semantic Metrics can be applied by preserving the order of words

Special Thanks To…

Dr. Venu DasigiDr. Vipa Phuntumart

Dr. Ray KresmanPukar Hamal

Clustering of yeast genes using Literature Mining

Documents