It & Health 2010 Summary Thomas Nordahl Petersen
Dec 21, 2015
It & Health 2010Summary
Thomas Nordahl Petersen
DNA/RNA
• DNA findes I celle kernen (Eukaryoter)• base paring• T substituted with U in RNA• Reading direction• Reading frame (1,2,3,-1,-2,-3)• 64 codons• DNA -> mRNA• Intron, exon & UTR (non-coding exon)• Intron/Exon splice site
Reading frame andreverse complement
TGCCATGCATAGCCCCTGCCATATCT
Having a piece of DNA like:
Forward strings & reading frames1 : TGCCATGCATAGCCCCTGCCATATCT2 : GCCATGCATAGCCCCTGCCATATCT3 : CCATGCATAGCCCCTGCCATATCT
Reverse complement strings & reading frames-1: TCTATACCGTCCCCGATACGTACCGT-2: CTATACCGTCCCCGATACGTACCGT-3: TATACCGTCCCCGATACGTACCGT
Amino acids
20 naturally occurring amino acids- mRNA -> protein- Reading direction- 4 backbone atoms- Amino acid properties
- Acidic, basic, polar, charged, hydrophibic
- 1 and 3 letter codes
Amino Acids
Amine and carboxyl groups. Sidechain ‘R’ is attached to C-alpha carbon
The amino acids found in Living organisms are L-amino acids
Amino Acids - peptide bond
N-terminal C-terminal
Databases and web-tools
Databases and biological information• Genbank• Uniprot
Web-tools• NCBI Blast• UCSC genome browser• Weblogo
CE
NT
ER
FO
R B
IOLO
GIC
AL
SE
QU
EN
CE
AN
ALY
SIS
Theory of evolution
Charles DarwinCharles Darwin1809-18821809-1882
Phylogenetic tree
Global versus local alignments
Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm).
Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm).
Global alignment
Seq 1
Seq 2
Local alignment
Pairwise alignment: the solution
”Dynamic programming” (the Needleman-Wunsch algorithm)
Sequence alignment - Blast
Sequence alignment - Blast
Blosum & PAM matrices
• Blosum matrices are the most commonly used substitution matrices.
• Blosum50, Blosum62, blosum80• PAM - Percent Accepted Mutations• PAM-0 is the identity matrix.• PAM-1 diagonal small deviations from 1, off-
diag has small deviations from 0• PAM-250 is PAM-1 multiplied by itself 250
times.
Sequence profiles (1J2J.B)
>1J2J.B mol:aa PROTEIN TRANSPORT NVIFEDEEKSKMLARLLKSSHPEDLRAANKLIKEMVQEDQKRMEK
Log-odds scores
• BLOSUM is a log-likelihood matrix:• Likelihood of observing j given you have i is
– P(j|i) = Pij/Pi
• The prior likelihood of observing j is– Qj , which is simply the frequency
• The log-likelihood score is– Sij = 2log2(P(j|i)/log(Qj) = 2log2(Pij/(QiQj))– Where, Log2(x)=logn(x)/logn(2) – S has been normalized to half bits, therefore the factor 2
BLAST Exercise
Genome browsers - UCSC
Intron - Exon structure
Single Nucleotide polymorphism - SNP
SNPs
Protein 3D-structure
Protein structure
Primary structure: Amino acids sequences
Secondary structure: Helix/Beta sheet
Tertiary structure: Fold, 3D cordinates
Protein structure-helix
helix 3 residues/turn - few, but not uncommon-helix 3.6 residues/turn - by far the most common helixPi-helix 4.1 residues/turn - very rare
Protein structurestrand/sheet
Protein folds
ClassAlpha,beta, alpha+beta and alpha/beta
And last class – none or few SS-elements
ArchitectureOverall shape of a domain
TopologyShare secondary structure connectivity
Protein 3D-structure
Neural NetworksFrom knowledge to information
Protein sequence Biological feature
• A data-driven method to predict a feature, given a set of training data
• In biology input features could be amino acid sequence or nucleotides
• Secondary structure prediction
• Signal peptide prediction
• Surface accessibility
• Propeptide prediction
Use of artificial neural networks
N C
Signalpeptide
Propeptide Mature/active protein
Prediction of biological featuresSurface accessible
Predict surface accessible fromamino acid sequence only.
Logo plots
Information content, how is it calculated - what does it mean.
Logo plots - Information Content
Sequence-logo
Calculate Information Content
I = apalog2pa + log2(4), Maximal value is 2 bits
• Total height at a position is the ‘Information Content’ measured in bits.• Height of letter is the proportional to the frequency of that letter.• A Logo plot is a visualization of a mutiple alignment.
~0.5 each
Completely conserved