General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory) GenBank: NCBI (National Center for.

General nucleic acid Sequence databases

• EMBL:(European Molecular Biology Laboratory)

http://www.ebi.ac.uk/Information/• GenBank: NCBI (National Center for Biote

chnology Information) http://www.ncbi.nlm.nih.gov/• DDBJ: DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/

Entry name; accession number; version number

General protein Sequence databases

• SWISS-PROT• PIR• PRF/SEQDB• PDB: It is the largest data bank of three-dimensional (3-D) biol

ogical macromolecular structure data.

coding sequences (CDS): from translation• TrEMBL• GenPret:

• SWISS-PROT is a highly curated database that contains excellent documentation. SWISS-PROT systematically merges variants and fragments into a single entry, but is greatly lagging behind the growth of the DNA data banks.

• PIR contains more sequences, including numerous “really sequenced” oligopeptides, but is not that tightly curated.

• The “automatic” data banks such as TrEMBL and GenPept are even larger, but contain little documentation and sometimes conceptual translations that are not actually found in nature.

BLAST Basic Local Alignment Search Tool• The BLAST algorithm breaks the query sequence

into short fragments, or “words,” and looks for an identical or close match between those words and words from the database sequences. When such a match or “hit” is encountered, the hit is extended in both directions to generate a local alignment segment. The quality of each alignment is quantified in a score, and the high-scoring segment pairs (HSPs) are reported in a table.

• BLASTN, which compares a nucleotide query sequence with a nucleotide sequence database; BLASTP, which compares a protein query sequence with a protein sequence database; BLASTX, which compares a nucleotide query sequence translated in all six open reading frames with a protein sequence database; TBLASTN, which compares a protein query sequence with a nucleotide sequence database dynamically translated in all six open reading frames; and TBLASTX, which compares a six-frame translation of a nucleotide query sequence with the six-frame translations of a nucleotide sequence database.

http://www.ddbj.nig.ac.jp/

Sequence alignment

Chapter 5

Measuring GeneticChange

D=s+wg

W=1P1 0+1x2=2P2 2+1x1=3

D=s+wg

W=1P1 0+1x2=2P2 2+1x1=3

W=2P1 0+2x2=4P2 2+2x1=4

W=3P1 0+3x2=6P2 2+3x1=5

W 小 gap 衝擊小Gap 多

W 大 gap 衝擊大Gap 少

D=s+wg

Gap 多 or 序列變異大 , W 可選小

Gap 少 or 序列保守 , W 可選大

D=s+wg

A B C D E F G H I K L M N P Q R S T V W X Y Z

A 8 -40 -

4-

2-

40 -

4-

2-

2-

2-

2-

4-

2-

2-

22 0 0 -

60 -

4-

2

B -48 -

68 2 -

6-

20 -

60 -

8-

66 -

40 -

20 -

2-

6-

8-

2-

62

C 0 -61

8-

6-

8-

4-

6-

6-

2-

6-

2-

2-

6-

6-

6-

6-

2-

2-

2-

4-

4-

4-

6

D -48 -

61

24 -

6-

2-

2-

6-

2-

8-

62 -

20 -

40 -

2-

6-

8-

2-

62

E -22 -

84 1

0-

6-

40 -

62 -

6-

40 -

24 0 0 -

2-

4-

6-

2-

48

F -4-

6-

4-

6-

61

2-

6-

20 -

60 0 -

6-

8-

6-

6-

4-

4-

22 -

26 -

6

G 0 -2-

6-

2-

4-

61

2-

4-

8-

4-

8-

60 -

4-

4-

40 -

4-

6-

4-

2-

6-

4

H -40 -

6-

20 -

2-

41

6-

6-

2-

6-

42 -

40 0 -

2-

4-

6-

4-

24 0

I -2-

6-

2-

6-

60 -

8-

68 -

64 2 -

6-

6-

6-

6-

4-

26 -

6-

2-

2-

6

K -20 -

6-

22 -

6-

4-

2-

61

0-

4-

20 -

22 4 0 -

2-

4-

6-

2-

42

L -2-

8-

2-

8-

60 -

8-

64 -

48 4 -

6-

6-

4-

4-

4-

22 -

4-

2-

2-

6

M -2-

6-

2-

6-

40 -

6-

42 -

24 1

0-

4-

40 -

2-

2-

22 -

2-

2-

2-

2

N -46 -

62 0 -

60 2 -

60 -

6-

41

2-

40 0 2 0 -

6-

8-

2-

40

P -2-

4-

6-

2-

2-

8-

4-

4-

6-

2-

6-

4-

41

4-

2-

4-

2-

2-

4-

8-

4-

6-

2

Q -20 -

60 4 -

6-

40 -

62 -

40 0 -

21

02 0 -

2-

4-

4-

2-

26

R -2-

2-

6-

40 -

6-

40 -

64 -

4-

20 -

42 1

0-

2-

2-

6-

6-

2-

40

S 2 0 -20 0 -

40 -

2-

40 -

4-

22 -

20 -

28 2 -

4-

60 -

40

T 0 -2-

2-

2-

2-

4-

4-

4-

2-

2-

2-

20 -

2-

2-

22 1

00 -

40 -

4-

2

V 0 -6-

2-

6-

4-

2-

6-

66 -

42 2 -

6-

4-

4-

6-

40 8 -

6-

2-

2-

4

W -6-

8-

4-

8-

62 -

4-

4-

6-

6-

4-

2-

8-

8-

4-

6-

6-

4-

62

2-

44 -

6

X 0 -2-

4-

2-

2-

2-

2-

2-

2-

2-

2-

2-

2-

4-

2-

20 0 -

2-

4-

2-

2-

2

Y -4-

6-

4-

6-

46 -

64 -

2-

4-

2-

2-

4-

6-

2-

4-

4-

4-

24 -

21

4-

4

Z -22 -

62 8 -

6-

40 -

62 -

6-

20 -

26 0 0 -

2-

4-

6-

2-

48

Blosum62mt

PAM500A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 1 -1

0 1 -2

0 1 1 0 0 -1

0 -1

-3

1 1 1 -6

-3

0 1 0 0 -9

R -1

5 1 0 -4

2 0 -1

2 -2

-2

4 0 -4

0 0 0 4 -4

-2

0 1 0 -9

N 0 1 1 2 -3

1 1 1 1 -1

-2

1 -1

-4

0 1 0 -5

-3

-1

1 1 0 -9

D 1 0 2 3 -5

2 3 1 1 -2

-3

1 -2

-5

0 1 0 -7

-5

-1

2 2 0 -9

C -2

-4

-3

-5

22

-5

-5

-3

-4

-2

-6

-5

-5

-3

-2

0 -2

-9

2 -2

-4

-5

-2

-9

Q 0 2 1 2 -5

2 2 0 2 -1

-2

1 -1

-4

1 0 0 -5

-4

-1

2 2 0 -9

E 1 0 1 3 -5

2 3 1 1 -2

-3

1 -1

-5

0 1 0 -7

-5

-1

2 2 0 -9

G 1 -1

1 1 -3

0 1 4 -1

-2

-3

0 -2

-5

1 1 1 -8

-5

-1

1 1 0 -9

H 0 2 1 1 -4

2 1 -1

4 -2

-2

1 -1

-2

0 0 0 -2

0 -2

1 2 0 -9

I 0 -2

-1

-2

-2

-1

-2

-2

-2

3 4 -2

3 2 -1

-1

0 -5

0 3 -2

-2

0 -9

L -1

-2

-2

-3

-6

-2

-3

-3

-2

4 7 -2

4 4 -2

-2

-1

-1

1 3 -3

-2

-1

-9

K 0 4 1 1 -5

1 1 0 1 -2

-2

4 0 -5

0 0 0 -3

-5

-2

1 1 0 -9

M -1

0 -1

-2

-5

-1

-1

-2

-1

3 4 0 4 1 -1

-1

0 -4

-1

2 -1

-1

0 -9

F -3

-4

-4

-5

-3

-4

-5

-5

-2

2 4 -5

1 13

-4

-3

-3

3 13

0 -4

-5

-2

-9

P 1 0 0 0 -2

1 0 1 0 -1

-2

0 -1

-4

4 1 1 -6

-5

-1

0 1 0 -9

S 1 0 1 1 0 0 1 1 0 -1

-2

0 -1

-3

1 1 1 -3

-3

-1

1 0 0 -9

T 1 0 0 0 -2

0 0 1 0 0 -1

0 0 -3

1 1 1 -6

-3

0 0 0 0 -9

W -6

4 -5

-7

-9

-5

-7

-8

-2

-5

-1

-3

-4

3 -6

-3

-6

34

2 -6

-6

-6

-4

-9

Y -3

-4

-3

-5

2 -4

-5

-5

0 0 1 -5

-1

13

-5

-3

-3

2 15

-1

-4

-4

-2

-9

V 0 -2

-1

-1

-2

-1

-1

-1

-2

3 3 -2

2 0 -1

-1

0 -6

-1

3 -1

-1

0 -9

B 1 0 1 2 -4

2 2 1 1 -2

-3

1 -1

-4

0 1 0 -6

-4

-1

2 2 0 -9

Z 0 1 1 2 -5

2 2 1 2 -2

-2

1 -1

-5

1 0 0 -6

-4

-1

2 2 0 -9

X 0 0 0 0 -2

0 0 0 0 0 -1

0 0 -2

0 0 0 -4

-2

0 0 0 0 -9

* -9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

-9

1

• The cost for every pair of possible amino acid replacements defines a cost matrix that can be used to score the alignment. Protein sequence alignment programmes typically use matrices derived from empirical comparisons of protein sequences

• #z97619 AATCAA-TAG TTTTTTAATT GAAAACTGGA ATGAATGGTT TGACGAG-AA

• #z97620 AATCAA-TAG TTTTTTAATT GGAAACTGGG ATGAATGGTT TGACGAA-AA

• #u18065 TAATCATTAG TTTCTTAATT AGGGGCTTGA ATGAAGGGAT TGACGAGAAA

• #u18066 TAATCATTAG TTTCTTAATT AGGGGCTTGA ATGAATGGAT TGACGAGAAA

• #u18069 AATCA-TTAG TCTCTTAATT AGAGGCTTGA ATGAATGGTT TAACGAG-AA

• #u18070 AATCA-TTAG TCTCTTAATT GGGGGCTTGA ATGAATGGTT TAACGAG-AA

• #u18071 AATCA-TTAG TTTCTTAATT AGAGGCTTGA ATGAATGGTT T-ACGAG-AA

• #u18068 AATCAGTTAG TTTCTTAATT AGAGGCTTGA ATGAATGGTT TAACGAG-AA

• #u18073 AATCA-TTAG TTTCTTAATT AGGGGCTTGT ATGAATGGTT TGACGAG-AA

• #u18074 AATCA-TTAG TTTCTTAATT AGAGGCTTGA ATGAATGGTT TCACGAG-AA

• #u18072 AATCA-TTAG TTTCTTAATT AGAGGCTTGT ATGAATGGTT TGACGAG-AA

• #u18064 AATCA-TTAG TTTCTTAATT AGAGGCTGGA ATGAATGGTT TGACGAG-AA

• #u18067 AATCA-TTAG TTTCTTAATT AGAGGCTGGA ATGAATGGTT TGACGAG-AA

• #af514505 AATCA-TTAG TTTCTTAATT GGGGACTGGA ATGAATGGTT TGACGAG-AA

• #z97617 AATCA-TTAG TCTCTTAATT AGAGACTGGA ATGAAGGGTT TAACAAG-AA

• #z97621 AATCA-TTAG TCTTTTAATT GAAGGCTGGT ATGAATGGTT TGACGAG-GA

• #z97623 AATCA-TTAG TCTTTTAATT GAAGACTGGA ATGAATGGTT TGACGAG-GA

•

As alignment, How to select W D=s+wg

• #z97619 TTATATAAAA TTTTATGTTT ACTTTATTTT TATAT---TT TATATATATT

• #z97620 ATAT---AAT TTTGTTTTTA CTTTTATTTT TATAT---TA AAAAAATATT

• #u18065 GATTTTATAT TATTTTAGTT TAGATTTTTA AATATAATTT TTATAATGTT

• #u18066 GATTTTATAT TATTTTAGTT TATATTTTTA AATATAATTT TTATAATGTT

• #u18069 ATTTTTATAT TATTTTGGTT T--ATTTTAA AATAAAATTT TTATAATGTT

• #u18070 ATTTTTATAT TATTTTGGTT T--ATTTTAA AATAAAATTT TTATAGTGTT

• #u18071 ATTTTTATAT TATTTTGGTT T--ATTTTTA AGTATAATTT TTATAATGTT

• #u18068 AATTTTATAT TATTTTGGTT T--ATTTTTA AATATAATTT TTACTATGTT

• #u18073 AAATTTATAT TATTTTAGTT T--ATTTTTA AGTATAAATT TTTAAATGTT

• #u18074 AGTTTTGTAT TATTTTAGCT T--ATCTTTT AATATAAGTT TTTTAATGTT

• #u18072 AATTTTTTAT TATTTTAGTT T--ATCTTTT AATATAGATT TTT-AATGTT

• #u18064 ATTTAATATT TCTTTTA--- -TTATCTTTT TATATTAAAT GT-TGATGTT

• #u18067 ATTTAATATT TTTTTTA--- -TTATCTTTT TATATTAATT GT-TGATGTT

• #af514505 AATTAATTTT TATTATATAG TTTATTTTTT AATGTTAATT TT-TATTGTT

• #z97617 -ATTTAATTT TGTTTTTTTG TAAATTTTGT TACTATTAAT TCAAAATATT

• #z97621 TGTAATGTAT TTTTGGATTG ----TTTTTT TACATGCATT A-GTTATATT

• #z97623 TTTATATTTG TATATGATAG ----TTTTGA AATATATTTT ATATTATATT

If indels were weighted 4, transversions 2, and transitions 1, the morphological character data were weighted 4. Leading and trailing gaps were weighted one-half internal gaps.These parameters, insertion:deletion cost (indel) and transversion:transition ratio (Tv:Ti) were variedIn all cases where morphological data were included, character transformations for morphology were weighted as equal to the indel cost.

ATCGATATGCTT

CGA

GCT

CGA C

GCT

C TGA C

GCT

3 changes 3 differences



. .. . .。。

靜者恆靜；動者恆動

0

0.03

0.06

0.09

0.12

0.15

0 0.05 0.1 0.15 0.2

TotalTvTotalTs

Tv

Ts

Tvs

Tv or Ts

Within sibling species

Among speciesIn the same genus

Among genera

Tv: 顛換取代

Ts: 轉換取代

JC69

Still A

Change to A

pA(0)=1

Transition

Transversion

Remain identical

Saturated effect in DNA mutation

1. GTTCTCAGAATC2. GATCACAGAAAC

T A T C G A

y = 0.1969x

y = 0.8027x

0

0.05

0.1

0.15

0.2

0 0.05 0.1 0.15 0.2 0.25

CPTS

CPTV

線性(CPTV)線性(CPTS) Total

鞘蛋白基因之取代趨勢 :

轉換取代速率為顛換取代的 4 倍 (0.8/0.2)

Ts: 轉換取代

Tv: 顛換取代

Felsenstein 81

Taxon G% A% T% C%Palaeopteran

Ephemerida 22.0 33.8 32.5 11.7Orthoptroid

Isoptera 20.1a 24.9 42.0 12.0 Grylloblateria 21.3a

Orthoptera-Loxo. 21.9a 30.4 35.9 11.8 Blattaria 17.5b 33.1 39.0 10.5 Phasimid 17.9b

Hemipteroid

Homoptera 16.4 32.3 40.7 10.6Holometabolous

Diptera 15.0 35.4 40.5 9.1 Coleoptera 16.4 35.6 39.4 8.6 Lepidoptera 13.6 39.7 39.1 7.6 Hymenoptera 9.6 44.5 39.3 6.7

結論 RNA 的二級結構限制了 5’UTR 的變異

性• 無論是短期 ( 群內 ) 或長期 ( 群間 ) 演化的結果

均顯示 RNA 的二級結構是必須的• RNA 的二級結構有其穩定性 , 分離株必須有特

定的二級結構 , 保留下來的可能性才高 ( 群間 ).• 5’UTR 正負股的二級結構各有其穩定性 .• 5’UTR 正負股的二級結構有不同的功能 , 兩種

不同的演化力量導致 promoter 成為 hypervariable region.

植物園麻竹相關之 satBaMV5’ UTR 之二級結構進化趨勢

BSL6

IV

I-8

6Vb

IIIII

I

I-6

6Va

DL-I DL-IIDL-IIIDL-IVDL-6V 1997BSL6 1995

1998I

A1

A1

A5A2

A3

鹼基組成在粒線體 16S rDNA 基因的演變趨勢

Evolutionary trend in base composition of mitochondrial 16S ribosomal

DNA

運用 16S rRNA 二級結構的訊息將各類昆蟲的 DNA 序列共同排列

吉普賽蛾 E. coli

A

0

10

20

30

40

50

60T

hys

Em

ph

Ord

o

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Thr

i

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

T

05

101520253035404550

Thy

s

Em

ph

Ord

o

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Thr

i

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

G

0

5

10

15

20

25

30T

hys

Em

ph

Od

on

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Th

ri

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

C

0

2

4

6

8

10

12

14

16T

hys

Em

ph

Ord

o

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Thr

i

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

0

5

10

15

20

25T

hys

Em

ph

Od

on

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Th

ri

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

多出的 G 是否都分佈在 loop 的位置

STEM-G

0

5

10

15

20

25

30

35T

hys

Em

ph

Od

on

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Th

ri

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

A

0

10

20

30

40

50

60T

hys

Em

ph

Ord

o

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Thr

i

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

LOOP-A

0

10

20

30

40

50

60T

hys

Em

ph

Od

on

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Th

ri

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

STEM-A

05

101520253035404550

Th

ys

Em

ph

Od

on

Ple

c

Isop

Ort

h

Bla

t

Cor

r

Th

ri

Hem

i

Hom

o

Meg

a

Lep

i

Col

e

Dip

t

Hym

e

0

5

10

15

20

25

30

35

05

101520253035404550

G

A

鹼基 G 在各類昆蟲中的遞減以鹼基 T 為主角看

在較原始的昆蟲鹼基 T 與 G 配

對在較進化的昆蟲鹼基 T 與 A 配

對

因此不會影響 RNA stem 的二級結構

HKY85

C G

General reversible model

+ + + + + + + + + + + + + + + +I-1 (A) GAAAACTCACCGCAACGAAACGAAAACAATCGTTCAGAAATACTTGACCACGAGGGGTCCCCTATAGTCCGCTTTGGCGGTGCGGCAGCCCCCGTGCGATAGGCTAACTGCGGTATTCCCCGCACTCCGTCGAGCGGTTAATACGACGCTTACCAAGACGII-1 (A) .....................................................................T..........................................................................................II-9 (A) .........................................................C......CTA..T..G...................................C.............................A.....................III-1(A) ..........................T..........................................T..........................................................................................IV-1 (A) ........................................................-.T.............G..........T...A-.................................................A.....................BB21 (A) .......................................G................-.T..T..........G..........T...G-.................................................A.....................6V-1 (A) .........................................................C......CTA..T..G..........................C........C.............................A.....................6V-6 (A) .........................CA..............................C...........T..........................................................................................BSL3 (A) .........................CA..............................-T.............G.....-.................................................................................BSL6 (A) .........................CA..............................C...........T........................................A.................................................DL11 (A) T........................CA.............................C............T..................G.......................................................................DL12 (A) .........................CA..........................................T..................-.......................................................................DL15 (A) .........................CA.............................CC............-AG..-TT........-.G.......................................................................DL16 (A) .........................CA..........................................T..........................................................................................DL23 (A) .................G.......CA..........................................T..........................................................................................BB18 (B) .........................CA.............C...A...........CA......C.A-..-.---AAG..........G.....................T...G.............................................BB23 (B) .........................CA.G...........C...A...........CC............-TG...AA..........G.....................T...G.............................................BB25 (B) ..............-..........CA.............C...A...........CC............-.GC..AG..........GT....................T...G.............................................BB28 (B) .....................-----A.............C...................T.......A.-.G.-CAG.T..............................T...G.....................................TT......BO20 (B) .........................CA.............C...A.....T.....T.............-.GC..AG.......T-.G.....................T...G.............................................BO23 (B) .........................CAG............C...A.....T.....C.............-.GC..AG..........G.....................T...G..-..........................................BV17 (B) .........................CA.............C...A...........CC......C..-..-.GC..AG..........G.....................T...G.............................................DL17 (B) .........................CA..............................C...........T......AG....T..T..G.....................T...G.............................................DL19 (B) .........................CA.............C...A.........A.TC............-.G..-AG..........G.T...................T...G.............................................DL20 (B) -------------------------------------------.A...........CC............-.GC..AG..........GT........................G.............................................DL21 (B) .........................CA.............C...A.....T.....CC..........A.-.G...AG.T........G.....................T...G.............................................BSF4 (B) .........................CA.............C...A...........CC............-.GC..AG....T.....G.....................T...G.............................................BSL4 (B) .........................CA.............C...A.....T.....C..T..........-.GC..AG..........G.....................T...G.............................................USA1 (B) .........................CA.............C...A.....T.....CC............-.GC..A-.....T....G.....................T...G.............................................BSL2 (B) .........................CA.................A...........CC......C..C..-.GC..AG..........G.....................T...G.............................................BSL1 (B) .........................CA.................A...........CC..........A.-.G...AG.T........G.....................T.................................................

Secondary structure simulation by Mfold program

U UU G

C-GG-CU-GC-G

U U G

A G U

A C U

C-GC-G

C CC AU-GG-CG-CG-CG-C

A

U UU G

C-GG-CC-GC-G

U U G

A G U

A C U

C-GC-G


A C

A GC U

▃

C U ▃

U UG U

A GU-GA-UU-GC-GC-G

C C ▃

C-GC-GG-CG-CG-C

A C

DL6V6, BSL6 DL12

U UU G

C-GG-CU-GC-G

U U G

A G U

A C U

C-GC-G

C CC AU-GC-GG-CG-CG-C

A C

DL11

U UU G

C-GG-CU-GC-G

U U G

A G U

A C U

C-GC-G


A C

DL16, DL23, III-1

DLI-1, DLII-1

BB21

U UG G

C-GG-CC-GC-GU-GG-CA-UU-G

A U

U C C C

U-A▃ ▃

U-GG-CG-CG-C

A C

U UG G

C-GG-CC-GC-GU-GG-CA-UU-G

A U

C C C C

U-A ▃ ▃

U-AG-CG-CG-C

A C

DLIV-1

DL15BSL3

G GU C

▃

U GG-UC-GG-CC-GC-GG-C

A U A U A C C C ▃

U-GG-CG-CG-CG-C

A C

U UU

C C

DL6V1

C GG U

U UC-GU-GU-GC-GA-UU-GC-GC-G

C C A

C-GG-CG-CG-CG-C

A C

A C

C C

▃

Group A

U UU G

C-GG-CU-GC-G

U U G

A G U

A C U

C-GC-G

C CC AC-GG-CG-CG-CG-C

A C

CC A

C

A GA G

C-GA-UU-GC-GC-G

C C A

A GC-GG-CG-CG-C

A C

DL20, BB25

UC GG A

C-GC-GC-G

U U G A G U

A C U

C-GC-G

C CC AC-GC-GG-UG-CG-C

A C

BO20

UC GG A

C-GC-GC-G

U U G A U G

A U C

C C-GC-G

C UU-GU-GG-CG-CG-C

A C

BB23

UU GG A

U-AC-GC-G

U U G A G

U A C

U C-GC-G

C CC AC-GC-GG-CG-CG-C

A C

UC GG A

C-GC-GC-G

U U G A G

U A C

U C-GC-G

C CC AU-GC-GG-CG-CG-C

A C

BO23

UC GG A

C-GC-GC-G

U U G A G

U A U

U C-GC-G


A C

BSF4 BV17

UC GG A

C-GC-GC-GG-U

A C GA C

U C-GC-G


A C

UU GG A

C-GC-GA-U

U U G

A G U

A C U

C-GC-G


A C

BSL1, DL21

UC GG A

C-GC-GC-G

U U G A G

U A C

U C-GC-G

U CC AU-GC-GG-CG-CG-C

A C

BSL4

U UG A

C-GC-GC-G

U U G A G

U A C

U C-GC-G

C CC AC-GU-GG-CA-UG-C

A C

DL19S

▃ ▃

▃

▃▃ ▃

▃

▃▃

▃

▃

▃

USA1

UC GG A

C-GC-GU-GG-CA-UU-G

A U C

C C

C A C

C-GC-GG-CG-CG-C

A C

BB18

DL17

U UC UG GU-AC-GU-GA-UU-GA-UU-GC-G

C UC A

C-GC-GG-CG-CG-CG-C

A .

▃ ▃▃ ▃

▃ ▃

▃

A

C C

G G

Group B

C U

BB28

CG A

C-GC-GA-U

A-UU-GU-GC-G

U C C

C AU-GG-CG-CG-CG-C

A C

BSL2

UC GG A

C-GC-GC-GC-GG-C

A C A U

C-GC-G

C CU A

C-GC-GG-CG-CG-C

A C

▃

G▃

U

A C

▃

U

U

RNA secondary simulation of satBaMV isolates

Positivestrand

Group A

U UU G

C-GG-CU-GC-G

U UG .

A GU .

A CU .

C-GC-G


A C

UC GG A

C-GC-GC-G

U UG .

A GU .

A UU .C-GC-G


A C

Group B IV-*

U UG G

C-GG-CC-GC-G

. UU-GG-CA-UU-G

A .U .

C CC .

C .U-AU-AG-CG-CG-C

A C

Positivestrand

RNA 的二級結構限制了 5’UTR 的變異性

5’UTR, 3’UTR 及其負股為複製酵素之辨識區，通常具特殊的二級結構

5’

5’

p20

Tsai et al.

3’ UTR ofBaMV

3’

3’

A1 A2 A3 A4 A5BSL66V4,6,7,96V1,8,10,12I-6,8II-5,9

A AC A

C-GG-C

C A G (A2)* 70 80 C G

AA U A A

C UG A

C-GC-G

G G 60U-G

C-G A (A2)G-C

90 G-CG-CG-CG-U

*I-1,2,3,4,5,7,9,10*III-5

A AC A

C-GG-C

C-G 70 80 C-G

A A G

C U A

G U A G(A3)

(A3) U C-G C-G G

U-G 60 C-G A (A4)

AG-C

90 G-CG-CG-CG-U

III-1,4,6,7,8,9II-1DL23DL16

II-3,6,7,8III-2,3

A AC A

C-GG-C

C-G 70 80 C-G

A A C U

C AG-UU-GC-G

G G U-G 60

C-G A

G-C 90 G-C

G-CG-CG-U

BB21

A AC C

C-GG-C

C-G 70 80 C-G

A AC C

UG AA U

C AG AU-G

(A5) U C-G 60 A A

90 G-CG-CG-CG-U

IV1~12

A AC C

C-GG-C

C-G 70 80 C-G

A AC C

U G A UA A

GC G G G 60

U-AU-A

90 G-CG-CG-C G-U(-

)(-)(-

)

(-)

(-)

5‘

5‘5‘5‘

5‘

3‘

3‘ 3

‘

3‘3‘

Negative strand

AC GU C

C-G

C-G 70 80 C-G

A A C

C U A

A U A

C-GC-G

G G 60 (B4) delete U-G C-G A (B2)

C-G A (B3) 90 G-C

G-CG-CG-U

BSF4BSL1DL21BB23BB25

B1 B4B2 B3DL15

A AA C

A-U

C-G 70 80 C-G

A A C

C U A

G U A

C-GC-G

G G 60 G

C-GC-G

90 G-CG-CG-CG-U

AC GU CC-G

C-G 70 80 C-G

A A C

C U A

G U A

C-GC-G

G U-G 60

C-G A

C-G 90 G-C

G-CG-CG-U

BO23DL11BSL4BB18

BO20DL19

AC GU CC-G

C-G 70 80 C-G

A A C

C U A

G U AC-GC-G

G U-G 60

C-GC-G

A 90 G-C

G-CG-CG-U(-

)

(-)

(-)

(-)

5‘

5‘

5‘

5‘

3‘

3‘

3‘

3‘

Negative strand

Group ABSL6

A AC C

C-GG-C

C A 70 80 C G

AA T A A

C TG A

C-GC-G

G G 60T-GC-GG-C

90 G-CG-CG-CG-T

DLIV

A AC C

C-GG-C

C-G 70 80 C-G

A AC C

T AG TA A

GC G G G 60

T-AT-A

90 G-CG-CG-CG-T(-)(-)

5‘5‘3‘ 3‘

A AC GT C

C-G

C-G 70 80 C-G

A A C

C T A

A T A

C-GC-G

G G 60 T- G

C-G C-G

90 G-C G-C G-C G-T(-)

5‘3‘

Group BBSF4

B

Negative strand

U UU G

C-GG-CU-GC-G

U UG .

A GU .

A CU .

C-GC-G


A C

UC GG A

C-GC-GC-G

U UG .

A GU .

A UU .C-GC-G


A C

Negative Positive

Group ABSL6

A AC C

C-GG-C

C A 70 80 C G

AA T A A

C TG A

C-GC-G

G G 60T-GC-GG-C

90 G-CG-CG-CG-T(-)

5‘3‘

AC GT C

C-G

C-G 70 80 C-G

A A C

C T A

A T A

C-GC-G

G G 60 T- G

C-G C-G

90 G-C G-C G-C G-T(-)

5‘3‘

Group BBSF4

Group ABSL6

Group BBSF4

Small values of result in an L-shaped distribution with extreme variation of rates; most sites are invariable but a few have very high rates of substitution

Parameter : the range of rate variation among sites

各 site 間的 rate 均等

Conservedregion

各 site 間的 rate 差大

Variable region

• This is primarily because the majority of substitutions happen at the same sites; that is , the variable positions.

• Obviously the more distantly related the sequences, the more pronounced this phenomenon becomes.

• Jin and Nei (1990) followed a similar approach, but assumed that substitution rates were Г- evolutionary model, which involves a parameter αthat describes the extent of the rate variation, they derived several equations to compute the evolutionary distance from the observed sequence dissimilarities.

• Relative nucleotide substitution rates in the SRC method are estimated by observing the frequencies with which sequence pairs differ at homologous positions.

• For an alignment of n sequences, TREECON computes n(n-1)/2 pairwise evolutionary distances d according to the Jukes and Cantor equation.

• When all pairwise distances have been computed, they are classified in several distance intervals (e.g., four).

• For each distance interval, the fraction of sequence pairs possessing a different nucleotide is plotted and a curve obeying he following equation:

• This is accomplished for all alignment positions

• The probability pi that an alignment position i contains a different nucleotide in two sequences, as a function of the evolutionary distance d separating these sequences.

• The slope of the curve through the origin yields the specific nucleotide substitution rate vi for the position under consideration.

• After estimation of all vi values, alignment positions are grouped into sets of similar variability and form a spectrum of relative nucleotide substitution rates.

Inferring Molecular PhylogenyDistance methods first convert aligned sequences into a pairwise distance matrix, then input that matrix into a tree building method,

whereas discrete methods consider each nucleotide site directly.

That the parsimony tree gives us the additional information of which site contributes to the length of each branch. Once we convert sequences into distance we lose this information.

Clustering methods

• Tree-building methods in the second class use optimality criteria to choose among the set of all possible trees. This criterion is used to assign to each tree a ‘score’ or rank which is a function of the relationship between tree and data.

Tree-building methods in the second class use optimality criteria to choose among the set of all possible tree (Fig. 6.3). This criterion is used to assign to each tree ‘score’ or bank which is a function of the relationship between tree and data (examples include maximum parsimony and maximum likelihood).

• What is the value of the optimality criterion for that tree?

• Which tree requires the fewest evolutionary events?

• While for small numbers of sequences (e.g. no more than 20) it is often possible to find the optimal tree (or trees), in many cases this is not feasible, in which case we have to rely on heuristic methods.

• A typical heuristic strategy is to start with a tree and rearrange it, keeping any rearrangement that produces a better tree. Such algorithms are often called ‘hill-climbing’.

Efficiency; Power; Consistency; Robustness; Falsifiability

Unweighted pair group method with arithmetic means (UPGMA)

• In an ultrametric tree all the tips are equidistant form the root of the tree, which is equivalent to assuming a molecular clock.

0.1715/2

0.2192/2

0.2795/2

• Distances are rarely, exactly tree metrics, and hence one class of ‘goodness of fit’ methods seeks the metric tree that best accounts for the ‘observed’ distances.

• The goodness of fit F between observed distance d

ij and tree distances pij for each pair of sequences i and j is given by.

• In the example just given we were fitting an additive tree with (2n-3) branches to

（） = n (n-1)/2 pairwise distances.n2

Distance methods

Minimum evolution• Given an unrooted metric tree for n sequences there a

re (2n-3) branches, each with length ei. The sum of these branch lengths is the length L of the tree:

The minimum evolution tree (ME) is the tree which minimizes L.

• More commonly, the branch lengths of the minimum evolution tree are estimated using least-squares methods. The branch lengths are estimated in the same way as for goodness of fit measures; however, rather than compare the fit of the observed distances the least squares branch lengths are added together to give the length of the tree.

Neighbour joining clustering

• Neighbour joining (NJ) is a widely used method for tree building which combines computational speed with uniqueness of result － most implementations give a single tree.

• One strategy for finding the ME tree is to first compute the NJ tree, then see if any local rearrangement of the NJ tree produces a shorter tree.

Terminal node i toall other taxa

New node u to the the terminal taxa i and j

0.2795/2 + (0.3959-0.4525)/2 =0.1114 0.2795-0.1114=0.1682

● node 1

(0.2147+0.3091-0.2795)/2=0.1222

Objection of distance method

• Summarizing a set of sequences by a pairwise distance matrix loses information;

• Branch lengths estimated by some distance methods may not be evolutionarily interpretable.

Discrete methods operate directly on the sequences, rather than on pairwise distances.

• The two major discrete methods are maximum parsimony (MP) and maximum likelihood (ML).

• Maximum parsimony choose the tree (or trees) that require the fewest evolutionary changes.

• Maximum likelihood chooses the tree (or trees) that of all tress is the one that is most likely to have produced the observed data.

1 ATATT2 ATCGT3 GCAGT4 GCCGT

The total number of evolutionary changes on a tree is simply the sum of the number of changes at each site.

1 ATATT2 ATCGT3 GCAGT4 GCCGT

Phylogenetically uninformative; sites that are invariant or sites where only one sequence has a different nucleotide are examples of such sites.

This is equivalent to saying the transversions are rarer than transitions, and therefore may be more reliable indicators of phylogeny.

Maximum likelihood requires three elements,

a model of sequence evolution a tree

the observed data.• for a given tree topology, what set of

branch lengths makes the observed.• Which tree of all the possible trees has the

greatest likelihood.

The log likelihood of obtaining the observed sequences is the sum of the log likelihoods of each individual site:

The 16 possible combinations of ancestral sites for a tree for four sequences.

• Obtaining the maximum likelihood estimate of branch lengths for a given tree is computationally time consuming, and in practice this has limited the application of the method to fairly small data sets.

• This model may include parameters for the transition/transversion ratio (TS/TV), base composition, and variation in rate among sites.

Objections to likelihood

• Which model to use, and what values of the parameters, such as transition/transversion ratio, should be employed.

• This is computationally time consuming, more than one maximal likelihood value may exist for a given tree.

Putting confidence limits on phylogenies bootstrap analysis

• Because we are sampling with replacement some sites may occur more than once in the pseudoreplicate, while others may not be represented at all.

• From this pseudoreplicate we would then build a tree using.

• We then repeat this two-step process a large number of time (anywhere from 100-to 1000-fold), resulting in a set of bootstrap trees.

General nucleic acid Sequence databases EMBL:(European Molecular Biology Laboratory) GenBank: NCBI (National Center for.

Documents

protein query sequence

nucleotide query sequence

database sequences

curated database

dna data banks

local alignment segment

dna data bank of japan

automatic data banks