An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April 24, 2007 Some material has been adapted from course notes from IBIOS 551: Genomics and BIOL 597F:
41
Embed
An Introduction to Bioinformatics Brian Canada PhD Candidate in Integrative Biosciences (IBIOS) Option in Bioinformatics & Genomics (BG) IST 497 - April.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
An Introduction to BioinformaticsAn Introduction to Bioinformatics
Brian CanadaPhD Candidate in Integrative Biosciences (IBIOS)
Option in Bioinformatics & Genomics (BG)
IST 497 - April 24, 2007Some material has been adapted from course notes from
IBIOS 551: Genomics and BIOL 597F: Bioinformatics I
Brian CanadaPhD Candidate in Integrative Biosciences (IBIOS)
Option in Bioinformatics & Genomics (BG)
IST 497 - April 24, 2007Some material has been adapted from course notes from
IBIOS 551: Genomics and BIOL 597F: Bioinformatics I
What is Bioinformatics?What is Bioinformatics?
Simplest definition: The use of computers to study biology
(particularly molecular biology and genetics) Highly interdisciplinary
Genomics Mapping & sequencing of entire genomes (all the DNA on all the
chromosomes in an organism) Functional genomics (sometimes called “phenomics”):
deducing information about the function of DNA sequences Proteomics
Prediction of protein structure and function from protein sequence Systems biology
Study of the dynamics with which genes and gene products interact with each other
Other applications Enzyme design/re-design Quantitative image analysis
Genomics Mapping & sequencing of entire genomes (all the DNA on all the
chromosomes in an organism) Functional genomics (sometimes called “phenomics”):
deducing information about the function of DNA sequences Proteomics
Prediction of protein structure and function from protein sequence Systems biology
Study of the dynamics with which genes and gene products interact with each other
Other applications Enzyme design/re-design Quantitative image analysis
Outline for lectureOutline for lecture
Some basic definitions How are genomes sequenced? What are some of the ethical and social concerns in
bioinformatics and genomics? What are the key computational skills & methods
used in bioinformatics? How do I use some of the more popular
bioinformatics tools?
Some basic definitions How are genomes sequenced? What are some of the ethical and social concerns in
bioinformatics and genomics? What are the key computational skills & methods
used in bioinformatics? How do I use some of the more popular
bioinformatics tools?
Some basic definitionsSome basic definitions
DNA - a double-stranded biological macromolecule (deoxyribonucleic acid) consisting of a sequence of 4 nucleotides: A = Adenine C = Cytosine G = Guanine T = Thymine
In double-stranded DNA, each nucleotide base-pairs with a complementary nucleotide: A base-pairs with T C base-pairs with G
DNA - a double-stranded biological macromolecule (deoxyribonucleic acid) consisting of a sequence of 4 nucleotides: A = Adenine C = Cytosine G = Guanine T = Thymine
In double-stranded DNA, each nucleotide base-pairs with a complementary nucleotide: A base-pairs with T C base-pairs with G
Image source: Wikipedia
Definitions, cont’dDefinitions, cont’d
mRNA (messenger RNA) - the single-stranded “transcribed” form of DNA, consisting of the nucleotides A, C, G, and U (uracil) mRNA is transcribed by an enzyme (catalytic protein)
called RNA polymerase Gene - a sequence of DNA that contains both
coding elements (exons) interspersed with noncoding elements (introns) mRNA contains only the exons – the parts of the gene
that “code” for a protein
mRNA (messenger RNA) - the single-stranded “transcribed” form of DNA, consisting of the nucleotides A, C, G, and U (uracil) mRNA is transcribed by an enzyme (catalytic protein)
called RNA polymerase Gene - a sequence of DNA that contains both
coding elements (exons) interspersed with noncoding elements (introns) mRNA contains only the exons – the parts of the gene
Protein - a macromolecule produced by the translation of the mRNA sequence Translation is mediated by tRNA (transfer RNA) and
rRNA (ribosomal RNA) Proteins consist of a combination of 20 different
amino acids linked by peptide bonds A sequence of three nucleotides is called a codon,
each of which corresponds to a specific amino acid Proteins carry out most of the functions of a cell
Protein - a macromolecule produced by the translation of the mRNA sequence Translation is mediated by tRNA (transfer RNA) and
rRNA (ribosomal RNA) Proteins consist of a combination of 20 different
amino acids linked by peptide bonds A sequence of three nucleotides is called a codon,
each of which corresponds to a specific amino acid Proteins carry out most of the functions of a cell
Codon tableCodon table
Central Dogma ofMolecular BiologyCentral Dogma ofMolecular Biology
DNA acts as a template to replicate itself DNA is also transcribed into RNA RNA is translated into protein
DNA acts as a template to replicate itself DNA is also transcribed into RNA RNA is translated into protein
Genotype and PhenotypeGenotype and Phenotype
Genotype refers to the specific hereditary genetic makeup of an individual organism Homozygous: both copies of a gene (or part of a
gene) are identical Heterozygous: offspring inherits one version of the
gene from one parent, and another version of the gene from the other parent
Phenotype refers to an organism’s observable trait or other characteristic that results from the interaction of genotype and environment
Genotype refers to the specific hereditary genetic makeup of an individual organism Homozygous: both copies of a gene (or part of a
gene) are identical Heterozygous: offspring inherits one version of the
gene from one parent, and another version of the gene from the other parent
Phenotype refers to an organism’s observable trait or other characteristic that results from the interaction of genotype and environment
The Human Genome Project (HGP)The Human Genome Project (HGP) Coordinated by DOE and NIH, begun in 1990 Objectives:
Identify all the genes in human DNA and how they vary within our species Determine the sequences of the 3 billion nucleotide basepairs that make
up human DNA Store this information in well-designed databases for easy retrieval Develop improved tools for analysis of gene sequence data Address the ethical, legal, and social issues (ELSI) that may arise from
the project
Private-sector effort conducted in parallel by Celera Genomics (headed by Craig Venter)
Working draft completed in 2003
Coordinated by DOE and NIH, begun in 1990 Objectives:
Identify all the genes in human DNA and how they vary within our species Determine the sequences of the 3 billion nucleotide basepairs that make
up human DNA Store this information in well-designed databases for easy retrieval Develop improved tools for analysis of gene sequence data Address the ethical, legal, and social issues (ELSI) that may arise from
the project
Private-sector effort conducted in parallel by Celera Genomics (headed by Craig Venter)
Working draft completed in 2003
The HGP approach to sequencing the human genome
The HGP approach to sequencing the human genome
Painstakingly precise Small pieces of DNA were “clipped”
from the 23 pairs of human chromo-somes, which were individually separated out of human blood and sperm cells
Each of these short DNA pieces wasindividually sequenced using electro-phoresis gels
Each piece of sequenced DNA was matched up with the DNA on eitherside of it in the chromosomal sequence
Analogous to taking out one page of an encyclopedia at a time, ripping that page up, and then putting it together again
Painstakingly precise Small pieces of DNA were “clipped”
from the 23 pairs of human chromo-somes, which were individually separated out of human blood and sperm cells
Each of these short DNA pieces wasindividually sequenced using electro-phoresis gels
Each piece of sequenced DNA was matched up with the DNA on eitherside of it in the chromosomal sequence
Analogous to taking out one page of an encyclopedia at a time, ripping that page up, and then putting it together again
The Celera Genomics approach to sequencing the human genome
The Celera Genomics approach to sequencing the human genome
“Shotgun” sequencing strategy All genes in all chromosomes are “torn up” simultaneously
and individually sequenced Computational methods are used to look for overlaps in the
sequence fragments to rebuild them into a whole genome Analogous to ripping up all pages of an entire
encyclopedia at once and then attempting to put it all back together
Much faster than traditional sequencing methods, but prone to incorrect assembly of “random” fragments
“Shotgun” sequencing strategy All genes in all chromosomes are “torn up” simultaneously
and individually sequenced Computational methods are used to look for overlaps in the
sequence fragments to rebuild them into a whole genome Analogous to ripping up all pages of an entire
encyclopedia at once and then attempting to put it all back together
Much faster than traditional sequencing methods, but prone to incorrect assembly of “random” fragments
What are some of the ethical and social implications and concerns of the human
genome project outcomes?
What are some of the ethical and social implications and concerns of the human
genome project outcomes? Fair use:
Who should have access to personal genetic information, and how will it be used?
Privacy and confidentiality: Who owns and controls genetic information?
Psychological impact and stigmatization: How does personal genetic information affect an individual and society's
perceptions of that individual? How does genomic information affect members of minority
communities?
Fair use: Who should have access to personal genetic information, and how will it
be used?
Privacy and confidentiality: Who owns and controls genetic information?
Psychological impact and stigmatization: How does personal genetic information affect an individual and society's
perceptions of that individual? How does genomic information affect members of minority
What are some of the ethical and social implications and concerns of the human
genome project outcomes?
What are some of the ethical and social implications and concerns of the human
genome project outcomes? Clinical issues:
How will genetic tests be evaluated and regulated for accuracy, reliability, and utility?
How do we prepare healthcare professionals for the new genetics? How do we prepare the public to make informed choices? How do we as a society balance current scientific limitations and social
risk with long-term benefits?
Uncertainties: Should testing be performed when no treatment is available? Should parents have the right to have their minor children tested for
adult-onset diseases? Are genetic tests reliable and interpretable by the medical community?
Clinical issues: How will genetic tests be evaluated and regulated for accuracy,
reliability, and utility? How do we prepare healthcare professionals for the new genetics? How do we prepare the public to make informed choices? How do we as a society balance current scientific limitations and social
risk with long-term benefits?
Uncertainties: Should testing be performed when no treatment is available? Should parents have the right to have their minor children tested for
adult-onset diseases? Are genetic tests reliable and interpretable by the medical community?
What are some of the ethical and social implications and concerns of the human
genome project outcomes?
What are some of the ethical and social implications and concerns of the human
genome project outcomes? Conceptual and philosophical implications
Do people's genes make them behave in a particular way? Can people always control their behavior? What is considered acceptable diversity? Where is the line between medical treatment and enhancement?
Reproductive rights and decision making: Do healthcare personnel properly counsel parents about the
risks and limitations of genetic technology? How reliable and useful is fetal genetic testing? What are the larger societal issues raised by new reproductive
technologies?
Conceptual and philosophical implications Do people's genes make them behave in a particular way? Can people always control their behavior? What is considered acceptable diversity? Where is the line between medical treatment and enhancement?
Reproductive rights and decision making: Do healthcare personnel properly counsel parents about the
risks and limitations of genetic technology? How reliable and useful is fetal genetic testing? What are the larger societal issues raised by new reproductive
In 1996, Kari Stefansson started his company, deCODE Genetics, with a mission to use population genetics to discover new genes associated with human disease
Target population: 275,000 living Icelanders Iceland’s government had originally endorsed
deCODE’s effort to obtain medical records of all Icelanders as well as the creation of “genomic fingerprints” from every citizen
In 1996, Kari Stefansson started his company, deCODE Genetics, with a mission to use population genetics to discover new genes associated with human disease
Target population: 275,000 living Icelanders Iceland’s government had originally endorsed
deCODE’s effort to obtain medical records of all Icelanders as well as the creation of “genomic fingerprints” from every citizen
What are the advantages of such a plan?
What are the advantages of such a plan?
Iceland’s population is highly homogeneous The vast majority have descended from a few European explorers
arriving in Iceland 1,000 years ago Icelanders have a strong tradition of maintaining family trees Single healthcare provider, so all medical records are in one
database Family relationships can thus be easily correlated with medical
records Therefore, finding significant genetic differences that lead to
certain medical conditions, such as cardiovascular disease, cancer, and schizophrenia, are likely to be easier than in a heterogeneous population (like that of the U.S.)
Iceland’s population is highly homogeneous The vast majority have descended from a few European explorers
arriving in Iceland 1,000 years ago Icelanders have a strong tradition of maintaining family trees Single healthcare provider, so all medical records are in one
database Family relationships can thus be easily correlated with medical
records Therefore, finding significant genetic differences that lead to
certain medical conditions, such as cardiovascular disease, cancer, and schizophrenia, are likely to be easier than in a heterogeneous population (like that of the U.S.)
Why was there opposition?Why was there opposition? Method for obtaining data and medical records was “opt-out”
(informed dissent) rather than “opt-in” (informed consent) Records and other data may be sold to other companies that
wanted to use this information to help develop new drugs Some felt patient-physician confidentiality was compromised, and
doctors worried that patients would be less forthcoming about their illnesses
Iceland’s supreme court ultimately ruled against the default of automatic inclusion in deCODE’s database Court based decision on complaints from a minor who objected to her dead
father’s information being included in the database Theoretically possible to use the father’s medical data to make inferences
about the daughter; could lead to unfairly assessed insurance premiums
Method for obtaining data and medical records was “opt-out” (informed dissent) rather than “opt-in” (informed consent)
Records and other data may be sold to other companies that wanted to use this information to help develop new drugs
Some felt patient-physician confidentiality was compromised, and doctors worried that patients would be less forthcoming about their illnesses
Iceland’s supreme court ultimately ruled against the default of automatic inclusion in deCODE’s database Court based decision on complaints from a minor who objected to her dead
father’s information being included in the database Theoretically possible to use the father’s medical data to make inferences
about the daughter; could lead to unfairly assessed insurance premiums
Interesting fiction about the ethics of genetics and genomics…
Interesting fiction about the ethics of genetics and genomics…
Only $18.45 at Amazon!
Outline for lectureOutline for lecture
Some basic definitions How are genomes sequenced? What are some of the ethical and social concerns in
bioinformatics and genomics? What are the key computational skills & methods
used in bioinformatics? How do I use some of the more popular
bioinformatics tools?
Some basic definitions How are genomes sequenced? What are some of the ethical and social concerns in
bioinformatics and genomics? What are the key computational skills & methods
used in bioinformatics? How do I use some of the more popular
bioinformatics tools?
Knowing where to look:Using public databases and data formats
Knowing where to look:Using public databases and data formats
PubMed: For surveying biological/medical literature http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
GenBank: Nucleic acid & protein sequences http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein
SWISS-PROT at ExPasy: Protein sequences http://us.expasy.org/sprot/
PFAM: Database of alignments of protein families http://www.sanger.ac.uk/Software/Pfam/
Protein Data Bank (PDB): Protein structure http://www.pdb.org
Gene Ontology (GO): A standardized vocabulary for describing protein functions http://www.geneontology.org/
OMIM (Online Mendelian Inheritance in Man): Catalog of genes and associated disorders http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
PhenomicDB: Simultaneously compare phenotypes of several organisms sharing homologous genes
http://www.phenomicdb.de
PubMed: For surveying biological/medical literature http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
GenBank: Nucleic acid & protein sequences http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein
SWISS-PROT at ExPasy: Protein sequences http://us.expasy.org/sprot/
PFAM: Database of alignments of protein families http://www.sanger.ac.uk/Software/Pfam/
Protein Data Bank (PDB): Protein structure http://www.pdb.org
Gene Ontology (GO): A standardized vocabulary for describing protein functions http://www.geneontology.org/
OMIM (Online Mendelian Inheritance in Man): Catalog of genes and associated disorders http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
PhenomicDB: Simultaneously compare phenotypes of several organisms sharing homologous genes
Protein structure visualization RCSB-PDB Explorer:
http://www.rcsb.org/pdb/home/home.do
Protein sequence analysis, structure prediction, and structural analysis
ExPASy: http://us.expasy.org/
Protein structural alignment and comparison
Combinatorial Extension of the Optimal Path (CE):
http://cl.sdsc.edu/
Protein structure visualization RCSB-PDB Explorer:
http://www.rcsb.org/pdb/home/home.do
Protein sequence analysis, structure prediction, and structural analysis
ExPASy: http://us.expasy.org/
Protein structural alignment and comparison
Combinatorial Extension of the Optimal Path (CE):
http://cl.sdsc.edu/
Image source: http://www.p450.kvl.dk/gallery/
Two “ubiquitous” bioinformatics tools
Two “ubiquitous” bioinformatics tools
BLAST: Basic Local Alignment Search Tool (Altschul et al, 1990)
Genome Browser at University of California–Santa Cruz (Kent et al, 2002)
BLAST: Basic Local Alignment Search Tool (Altschul et al, 1990)
Genome Browser at University of California–Santa Cruz (Kent et al, 2002)
BLAST: Basic Local Alignment Search Tool
BLAST: Basic Local Alignment Search Tool
Co-developed by Prof. Webb Miller, director of bioinformatics at PSU
Initially conceived to visualize DNA sequences retrieved from a database and identify local alignments to a query sequence Break the query and database
sequences into “words” of geneor protein letters, then seek matches between fragments
Uses “substitution matrices” anddynamic programming to calculate alignment scores
http://www.ncbi.nlm.nih.gov/BLAST/
Co-developed by Prof. Webb Miller, director of bioinformatics at PSU
Initially conceived to visualize DNA sequences retrieved from a database and identify local alignments to a query sequence Break the query and database
sequences into “words” of geneor protein letters, then seek matches between fragments
Uses “substitution matrices” anddynamic programming to calculate alignment scores
http://www.ncbi.nlm.nih.gov/BLAST/
Similarity and homologySimilarity and homology
Sequences (or structures or other objects) that look like each other are similar.
If that similarity results from their having a common ancestor, then those sequences are homologous. If the homologs have diverged because of a speciation event, the
sequences are orthologous Ex: Human hemoglobin vs. mouse hemoglobin
If the homologs have diverged because of gene duplication, the sequences are paralogous
Ex: Different versions of hemoglobin in human (adult vs. fetal) If the similarity results from convergent evolution from
ancestrally different sequences, then the sequences are analogous.
Sequences (or structures or other objects) that look like each other are similar.
If that similarity results from their having a common ancestor, then those sequences are homologous. If the homologs have diverged because of a speciation event, the
sequences are orthologous Ex: Human hemoglobin vs. mouse hemoglobin
If the homologs have diverged because of gene duplication, the sequences are paralogous
Ex: Different versions of hemoglobin in human (adult vs. fetal) If the similarity results from convergent evolution from
ancestrally different sequences, then the sequences are analogous.
Definition of alignmentsDefinition of alignments
Alignment A mapping of one sequence onto at least one other sequence to
bring out similarities An alignment column can contain matches, mismatches, or gaps
Global alignment The mapping extends throughout the sequences Appropriate when the sequences are homologous throughout
their lengths Local alignment
The mapping is limited to the regions of highest similarity Most appropriate for database searches
Alignment A mapping of one sequence onto at least one other sequence to
bring out similarities An alignment column can contain matches, mismatches, or gaps
Global alignment The mapping extends throughout the sequences Appropriate when the sequences are homologous throughout
their lengths Local alignment
The mapping is limited to the regions of highest similarity Most appropriate for database searches
Making a local alignmentMaking a local alignment
An alignment of two sequences (frequently called a local alignment) can be obtained as follows:
1. Extract a segment from each sequence2. Add dashes (gap symbols) to each segment to create
equal-length sequences3. Place one “padded” segment over the other
For example:AACC-GTACTTGA-CAGGTGG-TG
An alignment of two sequences (frequently called a local alignment) can be obtained as follows:
1. Extract a segment from each sequence2. Add dashes (gap symbols) to each segment to create
equal-length sequences3. Place one “padded” segment over the other
For example:AACC-GTACTTGA-CAGGTGG-TG
Alignment scoresAlignment scores
To distinguish between “good” and “bad” alignments, we need a rule that assigns a numerical score to any alignment. The higher the score, the better the alignment.
Example of a simple scoring rule: Match scores +1 Mismatch or gap scores -1 The following alignment scores +2 total (7 matches, 5
To distinguish between “good” and “bad” alignments, we need a rule that assigns a numerical score to any alignment. The higher the score, the better the alignment.
Example of a simple scoring rule: Match scores +1 Mismatch or gap scores -1 The following alignment scores +2 total (7 matches, 5
Substitution Matrices(also called “scoring matrices”)Substitution Matrices
(also called “scoring matrices”)
Scores depend on “evolutionary distance”
Example at right shows scores used in a human-mouse alignment
Scores depend on “evolutionary distance”
Example at right shows scores used in a human-mouse alignment
A C G T
A 91 -114 -31 -123
C -114 100 -125 -31
G -31 -125 100 -114
T -123 -31 -114 91
Amino Acid scoring matricesAmino Acid scoring matrices This is the BLOSUM62 amino acid
scoring matrix, which uses a database containing clusters of amino acid sequences with 62% or greater sequence similarity
Each score in the matrix is a “log odds” score
Positive score: In an alignment of two protein sequences, this amino acid pair is found more often than by chance
Negative score: less often than by chance
Zero score: same as expected by chance
More weight is given to the rarer amino acids, such as sulfur-containing residues (e.g., cysteine, C) or very large amino acids like tryptophan (W)
This is the BLOSUM62 amino acid scoring matrix, which uses a database containing clusters of amino acid sequences with 62% or greater sequence similarity
Each score in the matrix is a “log odds” score
Positive score: In an alignment of two protein sequences, this amino acid pair is found more often than by chance
Negative score: less often than by chance
Zero score: same as expected by chance
More weight is given to the rarer amino acids, such as sulfur-containing residues (e.g., cysteine, C) or very large amino acids like tryptophan (W)
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Image source: Bioinformatics: Sequence and Genome Analysis by David Mount (2nd ed., 2004)
Other languages: Java (for developing your own bioinformatics GUIs) Rapid prototyping languages: R, Matlab
Matlab Bioinformatics Toolkit
SummarySummary
Now you know: What bioinformatics is Where to look for biological data What kinds of skills and methods are used to analyze biological
data How to query BLAST and the UCSC Genome Browser What a sequence alignment is How the human genome was sequenced What are some of the questions surrounding the ethical and social
implications of human genome project
To learn more, consider taking BIOL 597F (Bioinformatics I) or IBIOS 551 (Genomics) in the fall
Now you know: What bioinformatics is Where to look for biological data What kinds of skills and methods are used to analyze biological
data How to query BLAST and the UCSC Genome Browser What a sequence alignment is How the human genome was sequenced What are some of the questions surrounding the ethical and social
implications of human genome project
To learn more, consider taking BIOL 597F (Bioinformatics I) or IBIOS 551 (Genomics) in the fall