BIOINFORMATIK I UEBUNG 2 http://icbi.at/ bioinf
Jan 03, 2016
U2AFGU YAGA
YAG
U1U4 U6
U5
GU
U2A
Spliceosome assembly
+ ~200 non-snRNPproteins
U4
U1
hnRNP
SR proteins
RNA helicases
kinases and phosphatases
Cyclophilins
U4 U6
U5
U2U6
U5YAGA
GUU1
Farnham, Nature Rev Genetics, 2009
ChIP procedure
AACTAGGTCAAAGGTCA
A/B A/B
E/F E/F
C C PPRE
PPAR RXRPPAR RXR
PPREDNA
Notepad++ and regular expressions
^ > . * \r \n
begin of line> any symbol
0 or more times
carriage return (CR) line feed (LF)
Notepad++ and regular expressions
character meaning
\ escape; used to make specials non-special
() group; you can retrieve its contents e.g. with \1 for the first occurrence
[] any character inside is considered a match
. matches any character
* match the previous character 0 or more times
+ match the previous character 1 or more times
{n} match the previous character n times
^ if the first character in the regex, means “beginning of line”; inside [] means “not”
$ last character in the regex, means “end of line”
\s any space character (space, tab)
\t tab (-->)
\r carriage return (CR)
\n line feed (LF)
Notepad++ and regular expressions
^[ACGT].*\r\n replace with
^(.{20}).*\r\n replace with \1\r\n
^>.*\r\n replace with
\r\n replace with
> replace with \r\n>
repeatMasking=none replace with \r\n
^>.*\r\n replace with .*(.{20})$ replace with \1
Gene Ontology
• cellular component (e.g. mitochondrium)• biological process (e.g. lipid metabolism)• molecular function (e.g. hydrolase activity)
Each entry in GO has a unique numerical identifier of the form GO:nnnnnnn, and a GO term
The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism.
ISS Inferred from Sequence SimilarityIEP Inferred from Expression PatternIMP Inferred from Mutant PhenotypeIGI Inferred from Genetic InteractionIPI Inferred from Physical InteractionIDA Inferred from Direct AssayRCA Inferred from Reviewed Computational AnalysisTAS Traceable Author StatementNAS Non-traceable Author StatementIC Inferred by CuratorND No biological Data available
3 organizing principles
Evidence code
Directed acyclic graph (DAG) with different levels and 2 relations (part_of, is_a)
Orthologs
Homologs: A – B – C
Orthologs: B1 – C1
Paralogs: C1 – C2 –C3
Inparalogs: C2 – C3
Outparalogs: B2 – C1
Xenologs: A1 – AB1
Protein A
Ortholog databases
• YOGY (eukarYotic OrtholoGY) is a web-based resource and integrates 5 independent resources (Sanger)
• COG Cluster of ortholog groups of proteins and KOG for 7 eukaryotic genomes (NCBI),
• Inparanoid (Center Stockholm Bioinformatics)
• HomoloGene (NCBI)
• OrthoMCL use Markov Clustering algorithm (University of Pennsylvania)
Exercise 2-1: REGULATORY GENOMICS
Pyruvate Carboxylase as example
Ensembl Biomart1.1 For the human transcript NM_000920 (pyruvate carboxylase) find official gene symbol, number of exons, Ensembl transcript ID, Ensembl gene ID, 3'UTR sequence as fasta file, length of 3'UTR
microRNA target prediction1.2 Is there a complementary sequence within the 3'UTR of PC to postion 2-8 in the sequence of microRNA hsa-mir-182.
UCSC genome browser1.3 Position of transcript start site and transcription end of Pyruvate carboxylase (NM_000920) in hg19 assembly
Exercise 2-1: REGULATORY GENOMICS
Find splicing signals1.4 Get sequences (+10bp/-10bp) around intron-exon borders and exon-intron borders from pyruvate carboxylase using UCSC table browser and Notepad++1.5 Construct in both cases sequence logo and frequency plot. Can you identify (regulatory) sequence motifs?
Regulatory motifs (transcription factor binding sites) 1.6 We know from Chromatin immunoprecipitation (ChIP-seq) experiments in a mouse cell line that the transcription factor Pparg is binding near the pyruvate carboxylase gene and hence potentially regulate its transcription (ppar.wig). Show binding region as custom track in UCSC genome browser and extract sequence.
Exercise 2-2: PROTEIN FUNCTION
Identify function /processes/pathways for a protein2.1 What is the function of pyruvate carboxylase and in which pathways and processes this enzyme is involved?Show pathway maps and find Enzyme ID (EC) using KEGGIdentify functional domains and Gene Ontology Annotation of the protein sequence using Uniprot, Prosite, Pfam
Find orthologs and perform multiple sequence alignment2.2 Find ortholog protein sequences in Mus musculus, Rattus norvegicus, Saccharomyces cervisiae, perform multiple sequence alignment using ClustalW, and visualize with Jalview.