Stickleback Seg Dup Analysis 1. Genome 2. Parameters for Pipeline 3. Analysis 4. Files and images are at http://eichlerlab.gs.washington.edu/help/linchen/stickleback/stickle backwgac.html 5. The Data is in directory http://eichlerlab.gs.washington.edu/help/linchen/sticklebac k/data/
19
Embed
Stickleback Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis 4.Files and images are at .
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• The Genome(v1.0) is down loaded from UCSU.• Total Length is 463,354,448bp which contains a chrUn of
62,550,211bp• Total of 29101 gene annotations from ensemble gene
annotation were down loaded from UCSC.
Seg Dup detection pipelines
• WGAC to detect Seg Dup in genomic assembly by looking for homology pairs. ( >1kb in length >90% identity)
• WSSDto detect Seg Dup in given sequences based on depth coverage of WGS (whole Genome shot gun reads). Depth coverage > Average + 3SD.
Parameters and Notes for WGAC pipeline
• Repeats– Standard repeat coordinated were reverse generated from the soft mask
data.
– The secondary repeat masker were done using two repeat libraries, the
ab_initio_lib.txt and supplemental_lib.txt.
– Repeat Mask result for all three libraries were combined and sorted, then used for both pipelines
• Blast parsing seeds in WGAC pipeline:– the seed size is 500bp
Result from WGAC Pipeline
• Total pairs of SD detected(>1kb and >90% identity) 152272• Inter chromosome pairs 63744• Intra chromosome pairs 88528• chrUn intra 81641• chrUn inter and intra 123278• Total NR
40,573,574bp
Notes:• In general, the number of WGAC pairs is too high (10%) for stickleback
genome with only 400mb.• 92% of total intra chromosomal WGAC pairs and 81% total pairs has at
least one sequence in the pair is on chrUn. The result is expected, since chrUn contains high percentage of redundant poorly assembled sequences.
• Our analysis also suggest that the potential repeats which are not covered by the repeat libraries, may also detected as WGAC pairs. Next slid.
Repeats?• Since the repeats might be an issue, I set up a filter to determine how many
of WGACs may be affected. If I use >20hit, 400bp on boundary, hit length <10kb, it affected 30% of WAC pairs. If I use >10hit, and 400bp bound overlap, and hit < 10kb, 60% of WGAC is affected.
• I then generate the nr space of these hit. They are total of 7,481,640bp from 103, 157 pairs in total WGAC (152, 272 pairs of total 40,473,574bp). It has 2/3 of hits, but only 1/5 of total nr space.
• I think it is very reasonable. Because the high proportion of the WGAC pairs only affect a small proportion of NR space.
• These sequence intervals should also be detected by WSSD if they are the repeats.
• However, I did not take them out from Alldup(which is a merge of WGAC and WSSD) yet, because many of them has high frequency hit on chrUn. At this stage we do not know if they are the redundant sequences or the real seg dup. But we can pull them out at any time based on the coordinates.
• If I use >20hit, 400bp on boundary, hit length <10kb, 30% of WGAC can be
General analysis of WGAC length and identity distribution
length distribution
0
20000000
40000000
60000000
80000000
100000000
120000000
1.k
b
2.k
b
3.k
b
4.k
b
5.k
b
6.k
b
7.k
b
8.k
b
9.k
b
10
.kb
20
.kb
30
.kb
40
.kb
50
.kb
length
tota
l (b
p)
inter
intra
identity distribution
0
20000000
40000000
60000000
80000000
100000000
120000000
140000000
160000000
identity
tota
l(bp)
inter
intra
1. Length distribution peaked at < 3kb, intra > inter, with 92% of intra on chrUn.2. Identity distribution peaked at 96%. Few is high than 99%.
General analysis, NR distribution on chromosome.high SD in chrUn
nr lengh on chromosome
0
5000000
10000000
15000000
20000000
25000000
chrI
chrI
I
chrI
II
chrI
V
chrI
X
chrU
n
chrV
chrV
I
chrV
II
chrV
III
chrX
chrX
I
chrX
II
chrX
III
chrX
IV
chrX
IX
chrX
V
chrX
VI
chrX
VII
chrX
VIII
chrX
X
chrX
XI
chromosome
tota
l (b
p)
inter
intra
both
Percentage of Dup NR relative to chromosome
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
chrI
chrI
I
chrI
II
chrI
V
chrI
X
chrU
n
chrV
chrV
I
chrV
II
chrV
III
chrX
chrX
I
chrX
II
chrX
III
chrX
IV
chrX
IX
chrX
V
chrX
VI
chrX
VII
chrX
VIII
chrX
X
chrX
XI
chromosome
per
cen
t
inter
intra
both
General view which show all WGAC on all chromosome
Concentration of SD on smaller supercontigs onchrUn
Global image shows the inter and intra pairs of 5kb and above 90% without the chrUn. The red indicates the inter chromosomal pairs and
blue indicates intra chromosomal pairs
Global image shows the inter and intra pairs of 10kb and 90% without chrUn. The red indicates the inter chromosomal pairs and blue
indicates intra chromosomal pairs
Global image shows the inter and intra pairs of WGAC with10kb and 90%. ChrUn is also included. The red indicates the inter chromosomal
pairs and blue indicates intra chromosomal pairs
chrUn
WSSD analysis
• Down load the WGS reads about 6 million.
• Down load Stickleback finished BAC. These BACs are used to determine the threshold for WGS depth coverage. For 5k window, the average number of reads is 78, with SD 27. The threshold for 5k window is 125. for 1k window is 25. (Average+3SD)
• Repeat mask of the stickleback genome. I used the standard, ab_initio_lib.txt and supplemental_lib.txt. In addition I added the potential repeats I detected in WGAC process which shows more than 20 hit pairs the same region.
total 463354448 40417128 18278097 22324144 10811957 7466140 11512187 2873518 13685475
Summary
• Stickleback Seg Dup has been detected using two independent pipelines WGAC and WSSD. Since each pipeline is based on its unique mechanism, we expect majority of the interval should be consistent with some variation. From the result of two pipeline, two set of genomic intervals were generated for Seg Dup.
– The first set consists of the genomic intervals detected by WGAC and WSSD, which is the intersect interval between WGAC and WSSD. This set represents the most conservative estimate of SEG DUPs in Genome. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/wssd_wgac_intersect
– The second set is a union of the interval of WAGC and WSSD (AllDup.tab), which represent the largest estimate of the SEG DUP in the genome. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/allDup.tab
– A list of genes intersecting with each set were also generated.• With AllDUp, union of WGAC and WSSD. There are total 3153 genes.
• With Dup from WGAC and WSSD intersect. There are total 1267 genes. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/gene_in_wssd_wgac_intersect
• A list of interval with potential to be repeats is also generated. They are the region with high frequency of hit with defined the boundary ( >10hits, <400bp at bound, <10kb in length). They account for >60% of total WAGC pairs and 1/5 of WGAC NR intervals. http://eichlerlab.gs.washington.edu/help/linchen/stickleback/data/repeathitMerge
• ChrUn contigs contribute great deal to the total SD in both WGAC and WSSD. The identity distribution analysis shows that the identity of pairs are less than 99%, suggest they may contain true SD which are hard to assemble. But how many of them remain to be determined.