Graphics Application Lab Genomic Sequence alignments and its application 조조조 조조 조조조조조 조조조조 조조 조조조 조조조 [email protected]
Jan 03, 2016
Graphics Application Lab
Genomic Sequence alignments and its
application
조환규 교수 부산대학교 공과대학정보 컴퓨터 공학부[email protected]
2
Gra
ph
ics A
pp
licatio
n
Lab
Biology and Informatics
Mathematics : Physics = X : Biology X = ? Bioinformatics
o Understanding Biological System with Informaticso Molecular Biologyo Computational Biology
• Genomics• Proteomics• And several –omics
3
Gra
ph
ics A
pp
licatio
n
Lab
Main features of This Talk
이미 잘 정리된 Computing methodology 를 어떻게 Bioinformatics 에서 활용하는가 ?
Bioinformatics 에서 잘 정리된 방법론은 CS쪽에서 어떻게 활용하는가 ?
Case Study• Genomic sequencing alignment 와 program-copy detection과의 연관
4
Gra
ph
ics A
pp
licatio
n
Lab
Computing Space Transform
Normal Space Jewelry Space
PLUS
MINUS
5
Gra
ph
ics A
pp
licatio
n
Lab
Computing Space Transform(2)
Normal Space log Space
X , Y
X * Y
multiply
log X , log ylog
exponent
log x + log y
6
Gra
ph
ics A
pp
licatio
n
Lab
Computing Space Transform(3)
Program Space Genome Seq Space
Similarity( , )
Similarity( , )
CLUSTAL-W( a, b )
Pairwise Alignment
Protein a Protein bProgram a Program b
Basic keyword
7
Gra
ph
ics A
pp
licatio
n
Lab
Namely “ 우물론” 한 우물을 팔 것인가 ?
o 그러나 만일 끝끝내 물이 안 나올 경우라면 여러 우물을 돌아가며 조금씩 팔 것인가 ?
o 이것도 저것도 아니라면 ? 그렇다면 어떻게 팔 것인가 ?
o Avoid “reinventing wheel”
8
Gra
ph
ics A
pp
licatio
n
Lab
Genomic Sequence Alignments
Genomic Sequences, linearo DNA, RNA, Protein(amino acids)o Why linear ?
Goal of Molecular Biology or Life Scienceo Characterizing functions of geneso Understanding internal gene interactionso Understanding internal & external interactions
• Drug targeting(protein interaction)
Why Alignment ?
9
Gra
ph
ics A
pp
licatio
n
Lab
Human Binome BANK in 3280
Year 3280, dooms day So many binary data files
o Chips( cell-phone, cooker, TV… et al)o Computer disks
Figure out the contents of followingso 010111010101000100101000000001010101010….
?o 100000001010111111111001010101010010101…
?o 101011111111111110001010010100101010100…
?
10
Gra
ph
ics A
pp
licatio
n
Lab
Human-Binome-Project
They decided to establish Binary BANK.o Some are partially annotated
• Object code, text data , garbage• Protein , Gene , Junk-DNA
HUMAN-Binome PROJECT….. Starts…o 목적 : 각종 binary sequence 의 기능을 탐색o Mini-binary project ~~ E.Coli, C.Elegans
• Cell-phone, calculator, PDA…o Full sequencing of 300 Giga bytes DISK
HUMAN BINOME BANK(HBB)!
11
Gra
ph
ics A
pp
licatio
n
Lab
For an Unknown Seq. X
Sequence X from a hardwareo Several error bits includedo Fragment sequencing
Find a similar pattern in HBBo Function ?o Region ?o Size ?
Write a BIG paper….o Here it comes………
12
Gra
ph
ics A
pp
licatio
n
Lab
반도지역 SM-6-5 에서 흔히 발견되는Seq-45-6-X 종의 기능에 대한 고찰
팔봉이 , 2 급 원숭이우주대학 , 분자고생물학과
이번에 SM-6-5 지역에서 빈번하게 발견된 유전종들의 기능은 … . 아마도 이 부족들이 집중적으로 사용한 언어를 표기하기 위한 도구의 일부로 추정된다 . 또한 이들은 매우 다양한 표기형식을 가지고 있는 것으로
추정되는데 , 그 이유는 각 단어들이 나타나는 빈도에서 추정되는 엔트로피가 PR-99 지역에 나타나는 약 26 개보다 월등히 많은 곳으로
추정되므로 …………………… .
Journal of Science & Sciences No.187 (Vol. 213), pp.232-265, 3288
13
Gra
ph
ics A
pp
licatio
n
Lab
Dynamic Programming
A Basic Methodologyo For all kinds of alignment o Solution from all sub-partial solution
준비물o Objective functiono Dynamic programming formula(recursion)
• F(n) = F(n-1) + F(n-2),
o Base condition, F(0)=F(1)=1o Table, multi-dim. array structure
14
Gra
ph
ics A
pp
licatio
n
Lab
Global Alignment(1)
Basic scoring: o Match: 1, Mismatch: -1, Space: -2
How? o To find the alignment of two sequences of maximal sco
re
o Sequence alignment problem corresponds to the longest path problem form the source to the sink in this directed acyclic graph.
15
Gra
ph
ics A
pp
licatio
n
Lab
Global Alignment(2)
CACAGTGT 와 CAGGT
-2
-4
-6
-8
-10
-20 -12-6 -8 -10-4 -14 -16
1
-1
-1
2
-3
0
-5 -2
-3 0
-5 -7 -9
-2 -4 -6
1 -1 -1 -3
-1 0 0 -2
-7 -4
-11 -13
-8 -10
-5 -7
-2 -4
-3 -2 -1 1 -1 -1
C A C A G T G T
C
A
G
G
T
T
T
G
G
A
A
G
G
T
-
A
-
C
-
C
C
16
Gra
ph
ics A
pp
licatio
n
Lab
Local Alignment(1)
An alignment between a substring of s and a substring of to Each entry of (I,j) will hold the highest score of an align
ment between a suffix of s[1…i] and a suffix of t[1…j]
예 ) AGGTATTGA - CCTATGGC
17
Gra
ph
ics A
pp
licatio
n
Lab
Local Alignment(2)
AGGTATTG 와 CTATGC
0
0
0
0
0
00 00 0 00 0 0
0
0
0
0
0
0
0
1
0
0
0 0
1 0
0 0 2
0 1 0
A G G T A T T A
C
T
A
T
G
0
0 1
0 0 0
1 1 0
0 0 2
3 2 0
1 0 0 1 2 1
0 0 0 0 0 0 0 1C
A G G
- - C
T
T
A
A
T T A
T G C
18
Gra
ph
ics A
pp
licatio
n
Lab
Semi-global Alignment(1)
Given two sequences, check if one of them has a substring similar to the other entire sequence.
How? o Find alignments ignoring the beginning and end spaces
of the sequences Global alignment 와 비교
o CAGCA - CTTGGATTCTCGG <-semi-globalo - - -CAGCGTGG- - - - - - - (score: -19)
o CAGCACTTGGATTCTCGG <-global o CAGC- - - - -G -T- - - -GG (score: -12)
19
Gra
ph
ics A
pp
licatio
n
Lab
Semi-global Alignment(2)
CACAGTGT 와 CAGGT
-2
-4
-6
-8
-10
-20 -12-6 -8 -10-4 -14 -16
1
-1
-1
2
-3
0
-7 -4
-11 -13
-8 -10
-5 -7
-2 -4
-3 -2 -1 1 -1 -1
C A C A G T G T
C
A
G
G
T
C
C
A
A
C A
--
G
G
-5
-3
-2
0
-5 -7 -9
-2 -4 -6
1 -1 -1 -3
-1 0 0 -2
G
- T
T
G T
--
20
Gra
ph
ics A
pp
licatio
n
Lab
General Gap Penalty
Definitiono Gap: consecutive number k > 1 of spaces o When mutations are involved, the occurrence of
a gap with k spaces is more probable than the occurrence of k isolated spaces
o w(k) : penalty associated with a gap with k spaces
21
Gra
ph
ics A
pp
licatio
n
Lab
Affine Gap Penalty Function
Penalty for consecutive spaces <= isolated spaces Sub-additive function
o w(k1 +k2+…_kn) <= w(k1) + w(k2) +…+w(kn) Three arrays for dynamic programming
o a[i,j] = maximum score of an alignment between s[1…i] and t[1…i] that ends in s[i] matched with t[j]
o b[i,j] = maximum score of an alignment between s[1…i] and t[1…i] that ends in a space matched with t[j]
o c[i,j] = maximum score of an alignment between s[1…i] and t[1…i] that ends in s[i] matched with a space
22
Gra
ph
ics A
pp
licatio
n
Lab
Heuristic Alignment
Main difficulties o Search space, O(n^2) space or O(n^2 log n)timeo Optimality or Biologically-good Distance metrico Multiple alignment
Local searcho Diagonal region searchingo Visualization., e.g., Dotlet
BLAST approach for long sequenceo Small word matchingo And Extending from a highly matched region
23
Gra
ph
ics A
pp
licatio
n
Lab
Multiple Alignment
o ------AG---T----CGCTGC----o ------AGCGAT--CGCGCTGC---o ---TCGAGGCAA--GCTGCTGC-----o ------GGCGAT----CGCTGC-----
Problem hardness: o Optimal alignment : NP-hard
• Almost kinds of object functionso What if more than 1000 sequence ?
• SPACE COMPELXITY IN PRACTICE Pairwise alignment Star Alignment Tree alignment
24
Gra
ph
ics A
pp
licatio
n
Lab
Why multiple alignment ?
Finding Conserved regions Computer virus phylogeny constructing
• 300 sp./year• More than 10000 sp. : N• Number of files in a system : M > 100000• Detecting a CV takes O( N*M) checks!
tuberculosis• 8 종 , a conserved region, and polymorphic sites• 김철민 교수님 ( 부산의대 ) – 진단용 칩 제작
Phylogeny construct
25
Gra
ph
ics A
pp
licatio
n
Lab
Phylogeny 1 : hard Version
26
Gra
ph
ics A
pp
licatio
n
Lab
Phylogeny 2 : Probable version
27
Gra
ph
ics A
pp
licatio
n
Lab
Constructing Phylogenetic Tree
Distance matrix
Optimal Tree ?• Degree constraint, Steiner points, Quartet
method
A B C D E
A 0 4 17
3 8
B 0 11
5 12
C 0 6 12
D 0 8
E 0
A
B D
E
C
Graphics Application Lab
PART 2: Application
Detecting Source Code Plagiarism
29
Gra
ph
ics A
pp
licatio
n
Lab
Plagiarism, Plagiarism, Plagiarism
Linear Structureo Genomic sequenceso Plain articleso Programso Human behaviors on the time-lineo Time-series data sets
Student Reports Plagiarism Assignment Program copying Where is the original version of this one
? Web searching redundancy elimination
30
Gra
ph
ics A
pp
licatio
n
Lab
지문법 기반 표절 검사 시스템
시스템 검사대상 검사방법 기타Plagiarism.or
g일반문서 비공개 온라인 유료 , 대용량 DB 운영
IntergirGuard 일반문서 비공개 온라인 유료
EVE2 일반문서 비공개 온라인 유료 ,인터넷으로 유사한 문서 검색
CopyCatch 일반문서 문서내의 공통 어휘 빈도수 검사
비온라인
WordCheck 일반문서 문서내의 단어 사용 횟수 검사
비온라인
COPS 일반문서 문장 일치 여부 검사 비온라인 , 문장을 DB 에 저장
SCAM 일반문서 단어 일치 여부 검사 비온라인 , 단어를 DB 에 저장
교수 클럽 일반문서 단어 , 특수문자 , 공식 등 의 빈도수 검사
온라인 유료
SIM 프로그램 소스코드 토큰들의 참조 회수 비교
Siff 프로그램 소스코드 50 개의 대표문자 추출하여 비교
31
Gra
ph
ics A
pp
licatio
n
Lab
구조 기반 표절 검사 시스템
시스템 검사대상 검사방법 기타CHECK Latex 문서 문서의 특성을
tree 로 구성키워드 분포도 ( 지문법 )
을 일부적용
Plague 프로그램 소스코드
Longest Common Subsequence
YAP 프로그램 소스코드
스트링 매칭 방법 사용
YAP3 프로그램 소스코드
Karp-RabinGreedy-String-
Tiling
영어로 된 과제의 표절 검사
MOSS 프로그램 소스코드
스트링 매칭 방법 사용
온라인 시스템
Jplag 프로그램 소스코드
Greedy-String-Tiling
온라인 시스템
32
Gra
ph
ics A
pp
licatio
n
Lab
Fingerprinting Method
Keyword frequency similarityo 특정한 단어의 사용횟수 o Fingerprinting object
• Fixed size fingerprint• Easy to making Database• Quick searching• High false positive rate
Example, fingerprint vector
A c x t u x g r N …..
33
Gra
ph
ics A
pp
licatio
n
Lab
Attacking
Inserting redundant words Shuffling Cons and Pros
o Easy to use in document applicationo Hard to use in program file
Recent trendso Structure-oriented similarity measureo Greedy-Block-Removing methods…o Is this a basic concept of local alignment ?
Sample-Report-Server Building
34
Gra
ph
ics A
pp
licatio
n
Lab
Undergraduate Assignment
Programming Assignment cheating: o 이론적으로는 그 구별방법이 없다 .o Assignment cheating 은 비용이 크다
• Password breaking by Mafia
Assignment 의 출력은 동일하다 .o Correct program 들끼리만 비교
과제에 주어진 시간은 비교적 짧다 (3-4 일 정도 ). 수강생의 수는 적절하다 (300 명 이하 ). 프로그래밍 언어는 모두 동일하다 .
35
Gra
ph
ics A
pp
licatio
n
Lab
Program Cheating Techniques
Complete Copying Variable exchange Garbage code insertion Function transpose Code rewriting(partially) Library code replacing Merging different codes Function resolving Function rewriting
36
Gra
ph
ics A
pp
licatio
n
Lab
Computing Space Transform
Program Space Genome Seq Space
Similarity( , )
Similarity( , )
CLUSTAL-W( a, b )
Pairwise Alignment
Protein a Protein bProgram a Program b
Basic keyword
37
Gra
ph
ics A
pp
licatio
n
Lab
PROGRAM to PROTEIN
Program Languageo Keyword = { int, float, class….. }o Block Structure = “}”, “{“
Program Chromosome o Location independent code, JAVA class, C files
Non-Coding regiono /* this is a sample non-coding region */
Promotero Variable declaration, class definition
DNA = keyword sequence
38
Gra
ph
ics A
pp
licatio
n
Lab
Extracting Program DNA
Syntactic level Semantic level
Program Flow-graph
syntax Real running
Syntactic running
39
Gra
ph
ics A
pp
licatio
n
Lab
Flow-Graph Linearization
A
B
S
R
W
$ A B S R S B W Q %$ A B S R R R R S B W Q %$ A B W W W W Q %$ A W W Q %
Q
40
Gra
ph
ics A
pp
licatio
n
Lab
Example
main( ) {int i, j , k ;…………for( I = 1 . I <= 100 , i++) { ………
if ( ) x = y ; else ……. while( ccccc ) { } x = 23984 ;} // end of for………..
}
int for if = else while =
AGTCGCTTCGAAGCAA
41
Gra
ph
ics A
pp
licatio
n
Lab
Why Protein mapping ?
DNA sequence overlapo if = AA, then = AG, * = GA, return = GGo AAGGA = AG + GA or AA + GG + A
• Ambiguity resolving
20 Amino acid baseso About 20 keywordso 2-3 groups
• polar, non-polar • hydrophobic, hydrophilic • Charged, uncharged• Small, large
42
Gra
ph
ics A
pp
licatio
n
Lab
Amino Acids classification
43
Gra
ph
ics A
pp
licatio
n
Lab
Keyword Mapping Strategy
Convertibility = { for , while } Easy
Convertibility = { for, then } Hard
Convertibility = { if, ‘=‘ } Impossible? Procedure
o Preprocessingo Chromosome arrangemento Keyword selectiono Protein mapping
44
Gra
ph
ics A
pp
licatio
n
Lab
Copy Detecting System
CDS components = [K, M, P, T, A, G ]
o Keyword table 20 keywordo Matching Score matrix borrow from Protein(PAM)o Affine Gap Penalty o Threshold lengtho Alignment Set Scoringo maXimum Gap allowing
45
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Overview
Sample programs “data structure” Students , 60 Programming assignments, 12 set 1 semester On-line evaluation system = ESPA
o Java-based on-line evaluation systemo Due, 1 week
We do not monitor all programs
46
Gra
ph
ics A
pp
licatio
n
Lab
Clustal-W (1)(www2.ebi.ac.uk/clusterw)
Input Fasta file
47
Gra
ph
ics A
pp
licatio
n
Lab
Clustal-W(2) (www2.ebi.ac.uk/clusterw)
Output
48
Gra
ph
ics A
pp
licatio
n
Lab
PhyloDraw (cho et al, Bioinformatics 2001.)
49
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Result 1-0
유사한 그룹이 있는 11 개의 프로그램
50
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Result 1-1
Unit distance topological Representation
51
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Result 1-2
Rooted representation
52
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Result 1-3
Time-dependent dendrogram
53
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Result 1-4
Time-independent dendrogram
54
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Result 2
유사한 그룹이 있는 13 개의 프로그램
55
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Result 3
유사한 그룹이 있는 13 개의 프로그램
56
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Result 4
유사한 그룹이 있는 17 개의 프로그램
57
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Result 5
유사한 그룹이 있는 21 개의 프로그램
58
Gra
ph
ics A
pp
licatio
n
Lab
Experiment Result 6
유사도가 낮은 14 개의 프로그램
59
Gra
ph
ics A
pp
licatio
n
Lab
Another Application
Music Score Plagiarism o Tempo, melody line….o C major, A minor , key-transformation
Credit Card Bankruptcy Alert Drinking Alert
annotated time-line
60
Gra
ph
ics A
pp
licatio
n
Lab
Application
Web searching Engineo Eliminate redundant documentso Eg.) Query = “ 썬베드“ in EMPASS search engine
• 탐색된 상위 10 개의 관련 문서 중에서 8 개는 동일한 문서
o 신문기사 검색에서도 유사한 경우 Original Paper
61
Gra
ph
ics A
pp
licatio
n
Lab
Further Work
Program DNA-Bank server Copying Phylogenetics Building Parametric Method
o Fixed-size fingerprinting = program proteino PAM for program copying behavioro Real Practice
University Report Oracle Music Plagiarism
o Phylogeny tree for Old classical music(Palestria to Brahms) How to linearize a procedure call ?
o Parameter tuningo Procedure call is a sort of directed graph
Fast and moderate size Program DNA
62
Gra
ph
ics A
pp
licatio
n
Lab
Conclusion
Bioinformaticso Bioinformatics
• Bioinformatics Bioinformatics
Bioinformatics A brave new world……
Linear Structure Similarity o Local Alignmento Gap penalty o Structure-based similarity
Good Application
63
Gra
ph
ics A
pp
licatio
n
Lab
PUSAN BIOINFORMATICS JIHAD
64
Gra
ph
ics A
pp
licatio
n
Lab
Realtime Home-Bioinformatics