Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors Population Structure Analysis using STRUCTURE software Chang Bum Hong kt Bioinformatics TF, [email protected], twitter @hongiiv, hongiiv.tistory.com Friday, August 12, 11
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors
Population Structure Analysisusing STRUCTURE software
일반적으로 알콜을 섭취하게 되면 알콜은 아세트알데히드(얼굴을 붉게 만들고, 가슴도 콩닥 거리고, 구토를 일으키는 독성 물질)로 변하게 되고 이것이 다시 ALDH 에 의해 인체에 무해한 젖산으로 분해되는 과정을 거치게 됩니다. 이때 ALDH2라는 유전자가 바로 아세트알데히드가 조금이라도 생성되면 분해하는데 관여하게 이때 유전자형에 따라서 3가지 유형으로 나
타나게 됩니다.
Friday, August 12, 11
23andMe
Friday, August 12, 11
북서유럽
남동유럽
Friday, August 12, 11
Text
HGDP(Human Genome Diversity Project)
Friday, August 12, 11
Text
PASNP(Pan-Asian SNP Consortium)
Friday, August 12, 11
SNP Individual PopulationPASNP 54,794 1,928 75
HGDP 2,834~ 1,056 52
HapMap 1,481,135 1,397 11
SGVP 268,667 292 3
Korean 58,625 159 10
China(Yanbian) 58,625 16 1
Japan(Kobe) 58,625 5 1
Korea-Japan 58,625 6 1
Vietnam 58,625 16 1
Korean-Vietnam 58,625 8 1
Cambodia 58,625 16 1
Mongol 58,625 16 1
East Asia - Public genotype data
a. Pan-Asian SNP Consortium(http://www4a.biotec.or.th/PASNP)
b. Singapore Genome Variation Project(http://www.nus-cme.org.sg/SGVP)
• A matrix where the data for individuals are in rows, the loci are in column
• n consecutive rows have the data for each individual of n-ploid species
• Integer should be used for coding genotype
• Missing data should be indicated by a number which doesn’t occur elsewhere in the data (e.g. -1)
• The data file should be a text file (.txt) not an excel (.xls) for running STRUCTURE
Input data
Friday, August 12, 11
Input format
Information of user-defined populationsLable : 각 개인의 고유한 ID로 숫자 또는 문자 어떤것이든 상관없다.(예, CEPH1334.10)PopID: 개인이 속한 민족의 고유한 번호 (예, 중국인(CHB)인 경우 5, 유럽인(CEU)인 경우 1과 같이 자신이 직접 부여) Flag: 해당 PopID 정보를 STRUCTURE 프로그램 실행시 사용할 것인가?(1= 사용한다, 2= 사용하지 않는다.)Location: 해당 개인의 위치정보(예, 동아시아(EAS)인경우 1번, 유럽(EURA)인 경우 2번과 같이 자신이 직접 부여)
genotype (1,2,5)AA = 11AB = 12BB = 22
missing = 55
MarkerName...Label PopID Flag Location Genotype...
1 consecutive rows for alleles
Friday, August 12, 11
Input format (cont.)
Friday, August 12, 11
Running STRUCTURE from a graphical interface, Front End
Friday, August 12, 11
Importing input data into a project
Friday, August 12, 11
Importing input data into a project (cont.)
Friday, August 12, 11
Importing input data into a project (cont.)
Friday, August 12, 11
Importing input data into a project (cont.)
Friday, August 12, 11
Importing input data into a project (cont.)
Friday, August 12, 11
Importing input data into a project (cont.)
Friday, August 12, 11
Importing input data into a project (cont.)
Friday, August 12, 11
Importing input data into a project (cont.)
Friday, August 12, 11
Configuring a parameter set
Friday, August 12, 11
Configuring a parameter set (cont.)
Length of Burnin Period : how long to run the simulation before collecting data to minimize the effect of the starting configuration, 목표함수로 수렴할 때까지의 반복 숫자Number of MCMC Reps after Burnin : how long to run the simulation after burnin to get accurate parameter estimates
Friday, August 12, 11
Configuring a parameter set (cont.)
Friday, August 12, 11
Configuring a parameter set (cont.)
Friday, August 12, 11
Configuring a parameter set (cont.)
Friday, August 12, 11
Running STRUCTURE: a single run
Friday, August 12, 11
Running STRUCTURE: a single run (cont.)
Friday, August 12, 11
Running STRUCTURE: a batch run
Friday, August 12, 11
Running STRUCTURE: a batch run (cont.)
Friday, August 12, 11
Ln P(D): Estimated probability of Ks
Friday, August 12, 11
Friday, August 12, 11
• For very large data sets, the runtime of structure using default settings may become impractically slow
• reduced data sets (ex, pruned)
• get accurate results using much shorter runs than default (ex, small values of NUMREPS)
• download the source code and compile it on your machine (using 64-bit machine)
• use the command-line version of structure
Analysis of genome-wide SNP data
Friday, August 12, 11
An example of MCMC convergence
Friday, August 12, 11
Inference of true K(number of population)
• The log likelihood for each K, Ln P(D) = L(K)
• Two approaches to determine the best K
• Use of L(K) : When K is approaching a true value, L(K) plateaus and has high variance between runs
• Use of an ad hod quantity (∆K) : calculated based on the second order rate of change of the likelihood (∆K). The ∆K shows a clear peak at the true value of K
Friday, August 12, 11
Friday, August 12, 11
Q-metrixan individuals belongs to a subpopulation
Simulation Result
Friday, August 12, 11
Simulation Result (cont.)
Friday, August 12, 11
Enjoy running STRUCTURE
Friday, August 12, 11
We may not always be able to know the TRUE value K, but we should aim for the smallest value of K