Top Banner
The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National Laboratory 3/12/02 TRANSCRIPTOME 2002 Seattle, WA
27

The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

The Integrated Molecular Analysis of Genomes and their Expression

Consortium’s Data Mining Tools: Introducing the IQ

Peg Folta

Lawrence Livermore National Laboratory3/12/02

TRANSCRIPTOME 2002 Seattle, WA

Page 2: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

I.M.A.G.E. maintains world’s largest publicly available cDNA collection

5,819,514 clones arrayed

I.M.A.G.E. clones account for 64% of human ESTs in GenBank

cumulative

arrayed

*

Page 3: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

The I.M.A.G.E. collection has been shaped by projects (C-GAP, MGC…)

Xenopus

Human

Other

Zebrafish

Mouse

Species

Standard

Full-length

Norm/Sub

Normalized

Subtracted Norm/FL

Library Method

adult

embryonic

juvenile

Developmental state

abnormal

normal

treated

Tissue

3' EST5' EST

Full length

Clone sequence

Page 4: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Informatics focus this year was on tools to characterize and query the collection.

• IMAGEne – mature clustering tool

• IMAGEne Tissue – allows searching of tissue type dominance in clusters

• IQ – Intelligent Query tool allows mining of I.M.A.G.E. data

• Library/plate query – allows selective searching of libraries and plates

• Problem report and query – allows users to report or query problems related to I.M.A.G.E. clones

Redesign of data management system

Page 5: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

IMAGEne-Human Process

2,289,020Quality

I.M.A.G.Esequences

14,566NCBI

Ref Seq

IMAGEne1,676,516Sequences

623,294Sequences

RemainingSequences

>50 basepairs of contiguous, non-repeat sequence

Known Clusters

14,566CandidateClusters

w/consensus

67,521

I.M.A.G.E.Singletons

268,472

279,262Lower quality

I.M.A.G.ESequences

Page 6: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Initial query page, construct the query.

Page 7: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Clusters matching query results, chose your cluster.

Page 8: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Display of cluster

Page 9: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Known gene clusters with full length I.M.A.G.E. clones have doubled in number.

0

2000

4000

6000

8000

10000

12000

14000

16000

V3.0 V3.1 V3.2.1 V3.3

IMAGEne Versions

# of clusters

EmptyUnknownPartialPredicted FullFull

Clustercoverage

Avg. genelength

3392276333801896

1578

Page 10: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Known Gene Cluster distribution of full length clones

0

100

200

300

400

500

200 1200 2200 3200 4200 5200 6200 7200

Length of Clone

Number of Clones

avg. length = 948

Page 11: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

0100

200300400

500600

700800900

1000

1 2 3 4 5 6 7 8 9 10 11 12 14 15 21

Number of Contigs in a cluster

Number of Clusters

Candidate gene clusters consensus sequence and contigs are generated by CAP4

61,3144,971

824

95

227

40

Page 12: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Candidate Gene cluster characteristics.

1938

26236

28317 11030

full insert 3'&5' 3' only 5' only

Page 13: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Singleton: Wheat within the chaff

0

200

400

600

800

1000

1200

0 1000 2000 3000 4000High Quality Sequence Length

# of sequences

305 full insert sequences are singletons.

62,143 singletons have a 3’ PolyA site.

Avg. length is 547

Page 14: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

IMAGEne Tissue query allows searching for tissue proportions within clusters.

Page 15: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Introducing the Intelligent Query - IQ

• For a given category (currently clone and library) a user can specify a query based on key database attributes.

• The user can specify the fields returned.

• Various result format options (HTML, text)

• Initial version was rolled out last summer

• New functionality to be added this year (additional categories, etc.)

Page 16: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Specify a clone-based query.

Page 17: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Next specify what clone centric results will be provided and in what format.

Page 18: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

HTML version of clone-based query results.

Page 19: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Specify a library-based

query.

Page 20: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Similarly specify what library centric results will be provided.

Page 21: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

HTML version of library-based query results.

Page 22: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Other tools to mine I.M.A.G.E. information

Query plates from libraries. Query for reported problems.

Page 23: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Plates Source Well Error Rate

1-3705 Incyte 13

LLNL Master 10

Research Genetics 12

Resource Center of HumanGenome Project

10

ATTC 11

3,796-6000 Incyte 7

LLNL Master 7

Research Genetics 10

Resource Center of Human Genome Project

12

Quality control for historical collection

Page 24: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

QC on-goingMonths Well

error ratePlate Error Rate

Well error rate

Plate Error Rate

6/2000 1 (1,3) 0 7 (4,11) 2

10/2000 1 (0,3) 0 1 (0,3) 2

12/00 0 (0,2) 2 1 (0,3) 2

1/01 2 (1,4) 0 6 (4,11) 3

2/01 1 (0,3) 0 2 (1,5) 2

3/01 2 (1,5) 2 2 (1,5) 0

4/01 1 (0,3) 2 2 (1,4) 0

5/01 0 (0,1) 0 2 (1,5) 0

6/01 1 (0,3) 0 1 (0,4) 0

7/01 1 (0,4) 0 2 (1,6) 0

8/01 2 (1,3) 0 3 (2,6) 0

LLNL Replication Master vs. GenBank

Page 25: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Ongoing QC results

On-goingComparing master to GenBank

Error in replication @ LLNL

Page 26: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Next for I.M.A.G.E. Informatics

• Extensive expansion of query tools and data access

• IMAGEne non-species specific

• Analysis of human cluster candidate genes and singletons

• Redo of web site, easier to navigate

MUCH influenced by public needs…..you have a say!

Page 27: The Integrated Molecular Analysis of Genomes and their Expression Consortium’s Data Mining Tools: Introducing the IQ Peg Folta Lawrence Livermore National.

Acknowledgements

• LLNL– Christa Prange, I.M.A.G.E. PI – Tim Harsch, Amber Johnston, Julie Amundson

• Sponsors– DOE, Marv Stodolsky– NIH, Bob Strausberg

This work was partially funded by the NIH and was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48.

image.llnl.gov