Top Banner
An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April 24, 2007 Some material has been adapted from course notes from IBIOS 551: Genomics and BIOL 597F:
41

An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

An Introduction to BioinformaticsAn Introduction to Bioinformatics

Brian CanadaPhD Candidate in Integrative Biosciences (IBIOS)

Option in Bioinformatics & Genomics (BG)

IST 497 - April 24, 2007Some material has been adapted from course notes from

IBIOS 551: Genomics and BIOL 597F: Bioinformatics I

Brian CanadaPhD Candidate in Integrative Biosciences (IBIOS)

Option in Bioinformatics & Genomics (BG)

IST 497 - April 24, 2007Some material has been adapted from course notes from

IBIOS 551: Genomics and BIOL 597F: Bioinformatics I

Page 2: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

What is Bioinformatics?What is Bioinformatics?

Simplest definition: The use of computers to study biology

(particularly molecular biology and genetics) Highly interdisciplinary

Mathematics, statistics, computer science, biology, engineering

Simplest definition: The use of computers to study biology

(particularly molecular biology and genetics) Highly interdisciplinary

Mathematics, statistics, computer science, biology, engineering

Page 3: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Applications & Subfields of Bioinformatics

Applications & Subfields of Bioinformatics

Genomics Mapping & sequencing of entire genomes (all the DNA on all the

chromosomes in an organism) Functional genomics (sometimes called “phenomics”):

deducing information about the function of DNA sequences Proteomics

Prediction of protein structure and function from protein sequence Systems biology

Study of the dynamics with which genes and gene products interact with each other

Other applications Enzyme design/re-design Quantitative image analysis

Genomics Mapping & sequencing of entire genomes (all the DNA on all the

chromosomes in an organism) Functional genomics (sometimes called “phenomics”):

deducing information about the function of DNA sequences Proteomics

Prediction of protein structure and function from protein sequence Systems biology

Study of the dynamics with which genes and gene products interact with each other

Other applications Enzyme design/re-design Quantitative image analysis

Page 4: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Outline for lectureOutline for lecture

Some basic definitions How are genomes sequenced? What are some of the ethical and social concerns in

bioinformatics and genomics? What are the key computational skills & methods

used in bioinformatics? How do I use some of the more popular

bioinformatics tools?

Some basic definitions How are genomes sequenced? What are some of the ethical and social concerns in

bioinformatics and genomics? What are the key computational skills & methods

used in bioinformatics? How do I use some of the more popular

bioinformatics tools?

Page 5: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Some basic definitionsSome basic definitions

DNA - a double-stranded biological macromolecule (deoxyribonucleic acid) consisting of a sequence of 4 nucleotides: A = Adenine C = Cytosine G = Guanine T = Thymine

In double-stranded DNA, each nucleotide base-pairs with a complementary nucleotide: A base-pairs with T C base-pairs with G

DNA - a double-stranded biological macromolecule (deoxyribonucleic acid) consisting of a sequence of 4 nucleotides: A = Adenine C = Cytosine G = Guanine T = Thymine

In double-stranded DNA, each nucleotide base-pairs with a complementary nucleotide: A base-pairs with T C base-pairs with G

Image source: Wikipedia

Page 6: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Definitions, cont’dDefinitions, cont’d

mRNA (messenger RNA) - the single-stranded “transcribed” form of DNA, consisting of the nucleotides A, C, G, and U (uracil) mRNA is transcribed by an enzyme (catalytic protein)

called RNA polymerase Gene - a sequence of DNA that contains both

coding elements (exons) interspersed with noncoding elements (introns) mRNA contains only the exons – the parts of the gene

that “code” for a protein

mRNA (messenger RNA) - the single-stranded “transcribed” form of DNA, consisting of the nucleotides A, C, G, and U (uracil) mRNA is transcribed by an enzyme (catalytic protein)

called RNA polymerase Gene - a sequence of DNA that contains both

coding elements (exons) interspersed with noncoding elements (introns) mRNA contains only the exons – the parts of the gene

that “code” for a protein

Page 7: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

http://138.192.68.68/bio/Courses/biochem2/GeneIntro/GeneIntroResources/

Page 8: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Definitions, cont’dDefinitions, cont’d

Protein - a macromolecule produced by the translation of the mRNA sequence Translation is mediated by tRNA (transfer RNA) and

rRNA (ribosomal RNA) Proteins consist of a combination of 20 different

amino acids linked by peptide bonds A sequence of three nucleotides is called a codon,

each of which corresponds to a specific amino acid Proteins carry out most of the functions of a cell

Protein - a macromolecule produced by the translation of the mRNA sequence Translation is mediated by tRNA (transfer RNA) and

rRNA (ribosomal RNA) Proteins consist of a combination of 20 different

amino acids linked by peptide bonds A sequence of three nucleotides is called a codon,

each of which corresponds to a specific amino acid Proteins carry out most of the functions of a cell

Page 9: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Codon tableCodon table

Page 10: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Central Dogma ofMolecular BiologyCentral Dogma ofMolecular Biology

DNA acts as a template to replicate itself DNA is also transcribed into RNA RNA is translated into protein

DNA acts as a template to replicate itself DNA is also transcribed into RNA RNA is translated into protein

Page 11: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Genotype and PhenotypeGenotype and Phenotype

Genotype refers to the specific hereditary genetic makeup of an individual organism Homozygous: both copies of a gene (or part of a

gene) are identical Heterozygous: offspring inherits one version of the

gene from one parent, and another version of the gene from the other parent

Phenotype refers to an organism’s observable trait or other characteristic that results from the interaction of genotype and environment

Genotype refers to the specific hereditary genetic makeup of an individual organism Homozygous: both copies of a gene (or part of a

gene) are identical Heterozygous: offspring inherits one version of the

gene from one parent, and another version of the gene from the other parent

Phenotype refers to an organism’s observable trait or other characteristic that results from the interaction of genotype and environment

Page 12: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

The Human Genome Project (HGP)The Human Genome Project (HGP) Coordinated by DOE and NIH, begun in 1990 Objectives:

Identify all the genes in human DNA and how they vary within our species Determine the sequences of the 3 billion nucleotide basepairs that make

up human DNA Store this information in well-designed databases for easy retrieval Develop improved tools for analysis of gene sequence data Address the ethical, legal, and social issues (ELSI) that may arise from

the project

Private-sector effort conducted in parallel by Celera Genomics (headed by Craig Venter)

Working draft completed in 2003

Coordinated by DOE and NIH, begun in 1990 Objectives:

Identify all the genes in human DNA and how they vary within our species Determine the sequences of the 3 billion nucleotide basepairs that make

up human DNA Store this information in well-designed databases for easy retrieval Develop improved tools for analysis of gene sequence data Address the ethical, legal, and social issues (ELSI) that may arise from

the project

Private-sector effort conducted in parallel by Celera Genomics (headed by Craig Venter)

Working draft completed in 2003

Page 13: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

The HGP approach to sequencing the human genome

The HGP approach to sequencing the human genome

Painstakingly precise Small pieces of DNA were “clipped”

from the 23 pairs of human chromo-somes, which were individually separated out of human blood and sperm cells

Each of these short DNA pieces wasindividually sequenced using electro-phoresis gels

Each piece of sequenced DNA was matched up with the DNA on eitherside of it in the chromosomal sequence

Analogous to taking out one page of an encyclopedia at a time, ripping that page up, and then putting it together again

Painstakingly precise Small pieces of DNA were “clipped”

from the 23 pairs of human chromo-somes, which were individually separated out of human blood and sperm cells

Each of these short DNA pieces wasindividually sequenced using electro-phoresis gels

Each piece of sequenced DNA was matched up with the DNA on eitherside of it in the chromosomal sequence

Analogous to taking out one page of an encyclopedia at a time, ripping that page up, and then putting it together again

Page 14: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

The Celera Genomics approach to sequencing the human genome

The Celera Genomics approach to sequencing the human genome

“Shotgun” sequencing strategy All genes in all chromosomes are “torn up” simultaneously

and individually sequenced Computational methods are used to look for overlaps in the

sequence fragments to rebuild them into a whole genome Analogous to ripping up all pages of an entire

encyclopedia at once and then attempting to put it all back together

Much faster than traditional sequencing methods, but prone to incorrect assembly of “random” fragments

“Shotgun” sequencing strategy All genes in all chromosomes are “torn up” simultaneously

and individually sequenced Computational methods are used to look for overlaps in the

sequence fragments to rebuild them into a whole genome Analogous to ripping up all pages of an entire

encyclopedia at once and then attempting to put it all back together

Much faster than traditional sequencing methods, but prone to incorrect assembly of “random” fragments

Page 15: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

What are some of the ethical and social implications and concerns of the human

genome project outcomes?

What are some of the ethical and social implications and concerns of the human

genome project outcomes? Fair use:

Who should have access to personal genetic information, and how will it be used?

Privacy and confidentiality: Who owns and controls genetic information?

Psychological impact and stigmatization: How does personal genetic information affect an individual and society's

perceptions of that individual? How does genomic information affect members of minority

communities?

Fair use: Who should have access to personal genetic information, and how will it

be used?

Privacy and confidentiality: Who owns and controls genetic information?

Psychological impact and stigmatization: How does personal genetic information affect an individual and society's

perceptions of that individual? How does genomic information affect members of minority

communities?

Source: http://www.ornl.gov/sci/techresources/Human_Genome/elsi/elsi.shtml

Page 16: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

What are some of the ethical and social implications and concerns of the human

genome project outcomes?

What are some of the ethical and social implications and concerns of the human

genome project outcomes? Clinical issues:

How will genetic tests be evaluated and regulated for accuracy, reliability, and utility?

How do we prepare healthcare professionals for the new genetics? How do we prepare the public to make informed choices? How do we as a society balance current scientific limitations and social

risk with long-term benefits?

Uncertainties: Should testing be performed when no treatment is available? Should parents have the right to have their minor children tested for

adult-onset diseases? Are genetic tests reliable and interpretable by the medical community?

Clinical issues: How will genetic tests be evaluated and regulated for accuracy,

reliability, and utility? How do we prepare healthcare professionals for the new genetics? How do we prepare the public to make informed choices? How do we as a society balance current scientific limitations and social

risk with long-term benefits?

Uncertainties: Should testing be performed when no treatment is available? Should parents have the right to have their minor children tested for

adult-onset diseases? Are genetic tests reliable and interpretable by the medical community?

Source: http://www.ornl.gov/sci/techresources/Human_Genome/elsi/elsi.shtml

Page 17: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

What are some of the ethical and social implications and concerns of the human

genome project outcomes?

What are some of the ethical and social implications and concerns of the human

genome project outcomes? Conceptual and philosophical implications

Do people's genes make them behave in a particular way? Can people always control their behavior? What is considered acceptable diversity? Where is the line between medical treatment and enhancement?

Reproductive rights and decision making: Do healthcare personnel properly counsel parents about the

risks and limitations of genetic technology? How reliable and useful is fetal genetic testing? What are the larger societal issues raised by new reproductive

technologies?

Conceptual and philosophical implications Do people's genes make them behave in a particular way? Can people always control their behavior? What is considered acceptable diversity? Where is the line between medical treatment and enhancement?

Reproductive rights and decision making: Do healthcare personnel properly counsel parents about the

risks and limitations of genetic technology? How reliable and useful is fetal genetic testing? What are the larger societal issues raised by new reproductive

technologies?

Source: http://www.ornl.gov/sci/techresources/Human_Genome/elsi/elsi.shtml

Page 18: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

deCODE:A Case Study in Ethics

deCODE:A Case Study in Ethics

In 1996, Kari Stefansson started his company, deCODE Genetics, with a mission to use population genetics to discover new genes associated with human disease

Target population: 275,000 living Icelanders Iceland’s government had originally endorsed

deCODE’s effort to obtain medical records of all Icelanders as well as the creation of “genomic fingerprints” from every citizen

In 1996, Kari Stefansson started his company, deCODE Genetics, with a mission to use population genetics to discover new genes associated with human disease

Target population: 275,000 living Icelanders Iceland’s government had originally endorsed

deCODE’s effort to obtain medical records of all Icelanders as well as the creation of “genomic fingerprints” from every citizen

Page 19: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

What are the advantages of such a plan?

What are the advantages of such a plan?

Iceland’s population is highly homogeneous The vast majority have descended from a few European explorers

arriving in Iceland 1,000 years ago Icelanders have a strong tradition of maintaining family trees Single healthcare provider, so all medical records are in one

database Family relationships can thus be easily correlated with medical

records Therefore, finding significant genetic differences that lead to

certain medical conditions, such as cardiovascular disease, cancer, and schizophrenia, are likely to be easier than in a heterogeneous population (like that of the U.S.)

Iceland’s population is highly homogeneous The vast majority have descended from a few European explorers

arriving in Iceland 1,000 years ago Icelanders have a strong tradition of maintaining family trees Single healthcare provider, so all medical records are in one

database Family relationships can thus be easily correlated with medical

records Therefore, finding significant genetic differences that lead to

certain medical conditions, such as cardiovascular disease, cancer, and schizophrenia, are likely to be easier than in a heterogeneous population (like that of the U.S.)

Page 20: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Why was there opposition?Why was there opposition? Method for obtaining data and medical records was “opt-out”

(informed dissent) rather than “opt-in” (informed consent) Records and other data may be sold to other companies that

wanted to use this information to help develop new drugs Some felt patient-physician confidentiality was compromised, and

doctors worried that patients would be less forthcoming about their illnesses

Iceland’s supreme court ultimately ruled against the default of automatic inclusion in deCODE’s database Court based decision on complaints from a minor who objected to her dead

father’s information being included in the database Theoretically possible to use the father’s medical data to make inferences

about the daughter; could lead to unfairly assessed insurance premiums

Method for obtaining data and medical records was “opt-out” (informed dissent) rather than “opt-in” (informed consent)

Records and other data may be sold to other companies that wanted to use this information to help develop new drugs

Some felt patient-physician confidentiality was compromised, and doctors worried that patients would be less forthcoming about their illnesses

Iceland’s supreme court ultimately ruled against the default of automatic inclusion in deCODE’s database Court based decision on complaints from a minor who objected to her dead

father’s information being included in the database Theoretically possible to use the father’s medical data to make inferences

about the daughter; could lead to unfairly assessed insurance premiums

Page 21: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Interesting fiction about the ethics of genetics and genomics…

Interesting fiction about the ethics of genetics and genomics…

Only $18.45 at Amazon!

Page 22: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Outline for lectureOutline for lecture

Some basic definitions How are genomes sequenced? What are some of the ethical and social concerns in

bioinformatics and genomics? What are the key computational skills & methods

used in bioinformatics? How do I use some of the more popular

bioinformatics tools?

Some basic definitions How are genomes sequenced? What are some of the ethical and social concerns in

bioinformatics and genomics? What are the key computational skills & methods

used in bioinformatics? How do I use some of the more popular

bioinformatics tools?

Page 23: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Knowing where to look:Using public databases and data formats

Knowing where to look:Using public databases and data formats

PubMed: For surveying biological/medical literature http://www.ncbi.nlm.nih.gov/entrez/query.fcgi

GenBank: Nucleic acid & protein sequences http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein

SWISS-PROT at ExPasy: Protein sequences http://us.expasy.org/sprot/

PFAM: Database of alignments of protein families http://www.sanger.ac.uk/Software/Pfam/

Protein Data Bank (PDB): Protein structure http://www.pdb.org

Gene Ontology (GO): A standardized vocabulary for describing protein functions http://www.geneontology.org/

OMIM (Online Mendelian Inheritance in Man): Catalog of genes and associated disorders http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

PhenomicDB: Simultaneously compare phenotypes of several organisms sharing homologous genes

http://www.phenomicdb.de

PubMed: For surveying biological/medical literature http://www.ncbi.nlm.nih.gov/entrez/query.fcgi

GenBank: Nucleic acid & protein sequences http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein

SWISS-PROT at ExPasy: Protein sequences http://us.expasy.org/sprot/

PFAM: Database of alignments of protein families http://www.sanger.ac.uk/Software/Pfam/

Protein Data Bank (PDB): Protein structure http://www.pdb.org

Gene Ontology (GO): A standardized vocabulary for describing protein functions http://www.geneontology.org/

OMIM (Online Mendelian Inheritance in Man): Catalog of genes and associated disorders http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

PhenomicDB: Simultaneously compare phenotypes of several organisms sharing homologous genes

http://www.phenomicdb.de

Page 24: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Computational Methods in Bioinformatics

Computational Methods in Bioinformatics

Sequence alignment & sequence searching BLAST: Basic Local Alignment Search Tool

http://www.ncbi.nlm.nih.gov/BLAST/

Whole genome analysis UCSC Genome Browser

http://genome.ucsc.edu

Gene prediction GenScan: searches for putative (hypothetical) genes

http://genes.mit.edu/GENSCAN.html

Multiple sequence alignment ClustalW

http://www.ebi.ac.uk/clustalw/

Sequence alignment & sequence searching BLAST: Basic Local Alignment Search Tool

http://www.ncbi.nlm.nih.gov/BLAST/

Whole genome analysis UCSC Genome Browser

http://genome.ucsc.edu

Gene prediction GenScan: searches for putative (hypothetical) genes

http://genes.mit.edu/GENSCAN.html

Multiple sequence alignment ClustalW

http://www.ebi.ac.uk/clustalw/

Page 25: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Image source: http://www.biochemj.org/bj/370/0651/bj3700651.htm

A multiple sequence alignment (MSA) A multiple sequence alignment (MSA)

Page 26: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Computational Methods in BioinformaticsComputational Methods in Bioinformatics

Phylogenetic Analysis Attempts to describe

the evolutionary rela-tionships within a groupof sequences

Uses a “tree” or “cladogram” to re-present relationships

PHYLIP http://evolution.genetics.washington.edu/phylip.html

Phylogenetic Analysis Attempts to describe

the evolutionary rela-tionships within a groupof sequences

Uses a “tree” or “cladogram” to re-present relationships

PHYLIP http://evolution.genetics.washington.edu/phylip.html

Image source: http://www.nature.com/ng/journal/v33/n3s/fig_tab/ng1113_F1.html

Page 27: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Computational Methods in Bioinformatics

Computational Methods in Bioinformatics

Protein structure visualization RCSB-PDB Explorer:

http://www.rcsb.org/pdb/home/home.do

Protein sequence analysis, structure prediction, and structural analysis

ExPASy: http://us.expasy.org/

Protein structural alignment and comparison

Combinatorial Extension of the Optimal Path (CE):

http://cl.sdsc.edu/

Protein structure visualization RCSB-PDB Explorer:

http://www.rcsb.org/pdb/home/home.do

Protein sequence analysis, structure prediction, and structural analysis

ExPASy: http://us.expasy.org/

Protein structural alignment and comparison

Combinatorial Extension of the Optimal Path (CE):

http://cl.sdsc.edu/

Image source: http://www.p450.kvl.dk/gallery/

Page 28: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Two “ubiquitous” bioinformatics tools

Two “ubiquitous” bioinformatics tools

BLAST: Basic Local Alignment Search Tool (Altschul et al, 1990)

Genome Browser at University of California–Santa Cruz (Kent et al, 2002)

BLAST: Basic Local Alignment Search Tool (Altschul et al, 1990)

Genome Browser at University of California–Santa Cruz (Kent et al, 2002)

Page 29: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool

Co-developed by Prof. Webb Miller, director of bioinformatics at PSU

Initially conceived to visualize DNA sequences retrieved from a database and identify local alignments to a query sequence Break the query and database

sequences into “words” of geneor protein letters, then seek matches between fragments

Uses “substitution matrices” anddynamic programming to calculate alignment scores

http://www.ncbi.nlm.nih.gov/BLAST/

Co-developed by Prof. Webb Miller, director of bioinformatics at PSU

Initially conceived to visualize DNA sequences retrieved from a database and identify local alignments to a query sequence Break the query and database

sequences into “words” of geneor protein letters, then seek matches between fragments

Uses “substitution matrices” anddynamic programming to calculate alignment scores

http://www.ncbi.nlm.nih.gov/BLAST/

Page 30: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Similarity and homologySimilarity and homology

Sequences (or structures or other objects) that look like each other are similar.

If that similarity results from their having a common ancestor, then those sequences are homologous. If the homologs have diverged because of a speciation event, the

sequences are orthologous Ex: Human hemoglobin vs. mouse hemoglobin

If the homologs have diverged because of gene duplication, the sequences are paralogous

Ex: Different versions of hemoglobin in human (adult vs. fetal) If the similarity results from convergent evolution from

ancestrally different sequences, then the sequences are analogous.

Sequences (or structures or other objects) that look like each other are similar.

If that similarity results from their having a common ancestor, then those sequences are homologous. If the homologs have diverged because of a speciation event, the

sequences are orthologous Ex: Human hemoglobin vs. mouse hemoglobin

If the homologs have diverged because of gene duplication, the sequences are paralogous

Ex: Different versions of hemoglobin in human (adult vs. fetal) If the similarity results from convergent evolution from

ancestrally different sequences, then the sequences are analogous.

Page 31: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Definition of alignmentsDefinition of alignments

Alignment A mapping of one sequence onto at least one other sequence to

bring out similarities An alignment column can contain matches, mismatches, or gaps

Global alignment The mapping extends throughout the sequences Appropriate when the sequences are homologous throughout

their lengths Local alignment

The mapping is limited to the regions of highest similarity Most appropriate for database searches

Alignment A mapping of one sequence onto at least one other sequence to

bring out similarities An alignment column can contain matches, mismatches, or gaps

Global alignment The mapping extends throughout the sequences Appropriate when the sequences are homologous throughout

their lengths Local alignment

The mapping is limited to the regions of highest similarity Most appropriate for database searches

Page 32: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Making a local alignmentMaking a local alignment

An alignment of two sequences (frequently called a local alignment) can be obtained as follows:

1. Extract a segment from each sequence2. Add dashes (gap symbols) to each segment to create

equal-length sequences3. Place one “padded” segment over the other

For example:AACC-GTACTTGA-CAGGTGG-TG

An alignment of two sequences (frequently called a local alignment) can be obtained as follows:

1. Extract a segment from each sequence2. Add dashes (gap symbols) to each segment to create

equal-length sequences3. Place one “padded” segment over the other

For example:AACC-GTACTTGA-CAGGTGG-TG

Page 33: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Alignment scoresAlignment scores

To distinguish between “good” and “bad” alignments, we need a rule that assigns a numerical score to any alignment. The higher the score, the better the alignment.

Example of a simple scoring rule: Match scores +1 Mismatch or gap scores -1 The following alignment scores +2 total (7 matches, 5

mismatches/gaps)AACC-GTACTTGA-CAGGTGG-TG+-+--++-+-++

To distinguish between “good” and “bad” alignments, we need a rule that assigns a numerical score to any alignment. The higher the score, the better the alignment.

Example of a simple scoring rule: Match scores +1 Mismatch or gap scores -1 The following alignment scores +2 total (7 matches, 5

mismatches/gaps)AACC-GTACTTGA-CAGGTGG-TG+-+--++-+-++

Page 34: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Substitution Matrices(also called “scoring matrices”)Substitution Matrices

(also called “scoring matrices”)

Scores depend on “evolutionary distance”

Example at right shows scores used in a human-mouse alignment

Scores depend on “evolutionary distance”

Example at right shows scores used in a human-mouse alignment

A C G T

A 91 -114 -31 -123

C -114 100 -125 -31

G -31 -125 100 -114

T -123 -31 -114 91

Page 35: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Amino Acid scoring matricesAmino Acid scoring matrices This is the BLOSUM62 amino acid

scoring matrix, which uses a database containing clusters of amino acid sequences with 62% or greater sequence similarity

Each score in the matrix is a “log odds” score

Positive score: In an alignment of two protein sequences, this amino acid pair is found more often than by chance

Negative score: less often than by chance

Zero score: same as expected by chance

More weight is given to the rarer amino acids, such as sulfur-containing residues (e.g., cysteine, C) or very large amino acids like tryptophan (W)

This is the BLOSUM62 amino acid scoring matrix, which uses a database containing clusters of amino acid sequences with 62% or greater sequence similarity

Each score in the matrix is a “log odds” score

Positive score: In an alignment of two protein sequences, this amino acid pair is found more often than by chance

Negative score: less often than by chance

Zero score: same as expected by chance

More weight is given to the rarer amino acids, such as sulfur-containing residues (e.g., cysteine, C) or very large amino acids like tryptophan (W)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Image source: Bioinformatics: Sequence and Genome Analysis by David Mount (2nd ed., 2004)

Page 36: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

Can you score this alignment?MREQHMSCQH

Can you score this alignment?MREQHMSCQH

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 37: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

M R E Q H M S C Q H

5 -1 -4 +5 +8 = +13

More closely related than by chance!

M R E Q H M S C Q H

5 -1 -4 +5 +8 = +13

More closely related than by chance!

Page 38: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

So let’s see it action:the BLAST tutorial

So let’s see it action:the BLAST tutorial

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html

Page 39: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

The UCSC Genome BrowserThe UCSC Genome Browser

Much more interactive than BLAST and most other bioinformatics tools

Quick demonstration usingHBB (human beta hemoglobin, a blood protein)

http://genome.ucsc.edu

Much more interactive than BLAST and most other bioinformatics tools

Quick demonstration usingHBB (human beta hemoglobin, a blood protein)

http://genome.ucsc.edu

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 40: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

ProgrammingProgramming

BioPerl Open-source Perl tools for bioinformatics & genomics Includes a collection of modules that facilitate the

development of scripts for bioinformatics applications http://www.bioperl.org Online course:

http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/

Other languages: Java (for developing your own bioinformatics GUIs) Rapid prototyping languages: R, Matlab

Matlab Bioinformatics Toolkit

BioPerl Open-source Perl tools for bioinformatics & genomics Includes a collection of modules that facilitate the

development of scripts for bioinformatics applications http://www.bioperl.org Online course:

http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/

Other languages: Java (for developing your own bioinformatics GUIs) Rapid prototyping languages: R, Matlab

Matlab Bioinformatics Toolkit

Page 41: An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.

SummarySummary

Now you know: What bioinformatics is Where to look for biological data What kinds of skills and methods are used to analyze biological

data How to query BLAST and the UCSC Genome Browser What a sequence alignment is How the human genome was sequenced What are some of the questions surrounding the ethical and social

implications of human genome project

To learn more, consider taking BIOL 597F (Bioinformatics I) or IBIOS 551 (Genomics) in the fall

Now you know: What bioinformatics is Where to look for biological data What kinds of skills and methods are used to analyze biological

data How to query BLAST and the UCSC Genome Browser What a sequence alignment is How the human genome was sequenced What are some of the questions surrounding the ethical and social

implications of human genome project

To learn more, consider taking BIOL 597F (Bioinformatics I) or IBIOS 551 (Genomics) in the fall