Top Banner
Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor of Biostatistics Department of Epidemiology and Biostatistics School of Rural Public Health Texas A&M Health Science Center [email protected]
47

Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Dec 18, 2015

Download

Documents

Morgan Davis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Stata commands for moving data between PHASE and HaploView

Stata Conference DC ‘09July 30-31, 2009

John Charles “Chuck” Huber Jr, PhDAssistant Professor of Biostatistics

Department of Epidemiology and BiostatisticsSchool of Rural Public Health

Texas A&M Health Science [email protected]

Page 2: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Motivation

Many rapidly growing areas of research utilize multiple specialty “boutique” computer programs to conduct highly specialized analyses.

The Stata user is faced with two choices:1. Write new Stata commands that do the same analyses2. Write Stata commands that efficiently export and import data for these

“boutique” programs

Page 3: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Stata for Genetic Data Analysis

Outline

1. Genetic Data Analysis using Stata2. Genetics Background3. The “file” commands in Stata4. The phasein and phaseout commands5. The HaploView program6. The haploviewout command7. Summary

Page 4: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Stata for Genetic Data Analysis

2007 UK Stata Users Group meeting:http://www.stata.com/meeting/13uk/

A brief introduction to genetic epidemiology using Stata Neil Shephard, University of Sheffield

An overview of using Stata to perform candidate gene association analysis will be presented. Areas covered will include data manipulation, Hardy–Weinberg equilibrium, calculating and plotting linkage disequilibrium, estimating haplotypes, and interfacing with external programs.

Page 5: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

User Written Genetics Commands

Programs written by David Clayton• ginsheet- Read genotype data from text files.• gloci - Make a list of loci.• greshape - Reshape a file containing genotypes to a file of alleles.• gtab - Tabulate allele frequencies within genotypes and generate indicators (performs Hardy-Weinberg

Equilibrium testing).• gtype - Create a single genotype variable from two allele variables.• htype - Create a haplotype variable from allele variables.• mltdt - Multiple locus TDT for haplotype tagging SNPs (htSNPs).• origin - Analysis of parental origin effect in TDT trios.• pseudocc - Create a pseudo-case-control study from case-parent trios.• pscc - Experimental version of pseudocc in which there may be several groups of linked loci.• pwld - Pairwise linkage disequilibrium measures.• rclogit - Conditional logistic regression with robust standard errors.• snp2hap - Infer haplotypes of 2-locus SNP markers.• tdt - Classical TDT test.• trios - Tabulate genotypes of parent-offspring trios.

Page 6: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

User Written Genetics Commands

Programs written by Adrian Mander• gipf - Graphical representation of log-linear models.• hapipf - Haplotype frequency estimation using an EM algorithm and log-linear modelling.• pedread - Read's pedigree data file (in pre-Makeped LINKAGE format), similar to ginsheet• pedsumm - Summarises a pre-Makeped LINKAGE file that is currently in Stata's memory.• pedraw - Draws one pedigree in the graphics window• plotmatrix - Produces LD heatmaps displaying graphically the strength of LD between markers.• profhap - Calculates profile likelihood confidence intervals for results from hapipf• swblock - A step-wise hapipf routine to identify the parsimonious model to describe the Haplotype block

pattern.• qhapipf - Analysis of quantitative traits using regression and log-linear modelling when phase is unknown.• hapblock - attempts to find the edge of areas containing high LD within a set of loci

Page 7: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

User Written Genetics Commands

Programs written by Mario Cleves• gencc - Genetic case-control tests• genhw - Hardy-Weinberg Equilibrium tests• qtlsnp - A program for testng associations between SNPs an a quantitative trait.

Programs written by Catherine Saunders• co_power - Power calculations for Case-only study designs.• gei_matching - • geipower - Power calculations for Gene-Environment interactions.• ggipower - Power calculations for Gene-Gene interactions.• tdt_geipower - Power calculations for Gene-Environment interactions via TDT analysis.• tdt_ggipower - Power calculations for Gene-Gene interactions via TDT analysis.

Programs written by Neil Shephard• genass- Performs a number of statistical tests on your genotypic data and collates the results into a Stata

formatted data set for browsing.

Page 8: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The Post-Genome Era

February 16, 2001February 15, 2001

Page 9: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Scientific Method: Observe

Hartl & Jones (1998) pg 18, Figure 1.13

Page 10: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Scientific Method: Predict

Watson et al. (2004) pg 29, Box 2-2

Page 11: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Scientific Method: Manipulate

Page 12: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The Structure of DNA

Hartl & Jones (1998) pg 9, Figure 1.5

Page 13: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The Structure of DNA

Watson et al. (2004) pg 23, Figure 2.5

Page 14: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

What is a SNP?

• A SNP is a single nucleotide polymorphism (the individual nucleotides are called alleles)

ataagtcgatactgatgcatagctagctgactgacgcgat ataagtccatactgatgcatagctagctgactgaagcgat

ataagtccatactgatgcatagctagctgactgacgcgat

ataagtcgatactgatgcatagctagctgactgaagcgatSNP1 SNP2

Person 1 – Chromosome 1 Person 1 – Chromosome 2

Person 2 – Chromosome 1 Person 2 – Chromosome 2

Page 15: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Allelic Association• Simple 2x2 table• One table per SNP• Compute a simple chi-squared statistic or odds

ratio for each SNP

SNP1 Allele

g c

Case 250 750

Control 650 350

Page 16: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Genotypic Association

• Compute chi-squared tests • Allows testing of various disease models

(dominant, recessive, additivity)

SNP1 Genotype

gg gc cc

Case 100 250 150

Control 300 150 50

Page 17: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

What is a Haplotype?

• A haplotype is the combination of one or more alleles found on the same chromosome– Person 1 has a “gc” haplotype and a “ca” haplotype– Person 2 has a “cc” haplotype and a “ga” haplotype

ataagtcgatactgatgcatagctagctgactgacgcgat ataagtccatactgatgcatagctagctgactgaagcgat

ataagtccatactgatgcatagctagctgactgacgcgat

ataagtcgatactgatgcatagctagctgactgaagcgatSNP1 SNP2

Person 1 – Chromosome 1 Person 1 – Chromosome 2

Person 2 – Chromosome 1 Person 2 – Chromosome 2

Page 18: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Haplotypic Association

• Compute chi-squared tests • Two SNPs with genotypes a/g and c/t respectively

SNP1:SNP2 Haplotype

a:c a:t g:c g:t

Case 100 250 75 75

Control 300 100 50 50

Page 19: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Why are haplotypes important?

http://www.theboatrace.org/gallery/2009?page=7#

2009 Oxford and Cambridge Boat Race

Page 20: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Why are haplotypes important?

Chromosome R

Chromosome D

President VP State Defense Treasury

SNP1 SNP2 SNP3 SNP4 SNP5

Page 21: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Why are haplotypes important?

Chromosome R

Chromosome D

President VP State Defense Treasury

SNP1 SNP2 SNP3 SNP4 SNP5

Rearranging the members of each “chromosome” could have a profound effect!

Page 22: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Why are haplotypes important?

Hartl & Jones (1998) pg 18, Figure 1.13

Page 23: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Hartl & Jones (1998) pg 18, Figure 1.13

Page 24: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Why are haplotypes important?

Watson et al. (2004) pg 29, Box 2-2

Page 25: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The PHASE Program

• Unfortunately, haplotypes are not observed directly using modern, high-throughput lab techniques

• We observe genotypes and must infer the haplotype structure using algorithms

• PHASE is a very popular program for inferring haplotypes from many SNPs simultaneously (Stephens, Smith & Donnelly, 2001)

Page 26: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The phaseout Command

Raw Genotype Data in Stata

Page 27: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The phaseout Command

Input file format for PHASE

Page 28: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The phaseout Command

I need to get my data from here:

to here:

Page 29: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The “file” commands in Stata

Using “file open”, “file write” and “file close”

file open Example1 using "ExampleFile.txt", write replace

file write Example1 "Hello World" _newline(1)

file write Example1 "Why so blue?" _newline(1)

file close Example1

Page 30: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The “file” commands in Stata

Using “file open”, “file read” and “file close”

. file open Example2 using "ExampleFile.txt", read

. file read Example2 Line1

. file read Example2 Line2

. file close Example2

. disp "Line1: `Line1'"

Line1: Hello World

. disp "Line2: `Line2'"

Line2: Why so blue?

Page 31: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The phaseout Command

Syntax for phaseout

phaseout SNPlist , idvariable(string) filename(string) [missing(string) separator(string) positions(string)]

Examplelocal SNPList "rs1413711 rs3024987 rs3024989"

local PositionsList "674 836 1955“

phaseout `SNPList' , idvariable("id") filename("VEGF.inp") missing("X/X 9/9") positions(`PositionsList') separator("/")

Page 32: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The phaseout CommandExamplelocal SNPList "rs1413711 rs3024987 rs3024989"

local PositionsList "674 836 1955“

phaseout `SNPList' , idvariable("id") filename("VEGF.inp") missing("X/X 9/9") positions(`PositionsList') separator("/")

Page 33: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The phaseout CommandExamplelocal SNPList "rs1413711 rs3024987 rs3024989"

local PositionsList "674 836 1955“

phaseout `SNPList' , idvariable("id") filename("VEGF.inp") missing("X/X 9/9") positions(`PositionsList') separator("/")

Page 34: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The phasein Command

Syntax for phasein

phasein PhaseOutputFile [, markers(string) positions(string)]

Example

phasein VEGF.out, markers("MarkerList.txt") positions("PositionList.txt")

Page 35: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The phasein Command

Output file format from PHASE

Page 36: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The phasein CommandExamplephasein VEGF.out, markers("MarkerList.txt")

positions("PositionList.txt")

Page 37: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The phasein CommandExamplephasein VEGF.out, markers("MarkerList.txt")

positions("PositionList.txt")

Page 38: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The HaploView Program

• Once we have inferred our haplotypes, we can conduct further association analyses using the full complement of Stata commands.

• We might also want to explore our data in the popular program HaploView (Barrett et al, 2005)

Page 39: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The haploviewout Command

Syntax for haploviewout

haploviewout SNPlist , idvariable(string) filename(string) [positions(string)] [familyid(string)] [poslabel]

Examplelocal MarkerList "rs1413711 rs3024987 rs3024989“

haploviewout `MarkerList', idvariable(id) filename("VEGF") poslabel

Page 40: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The haploviewout CommandExamplelocal SNPList "rs1413711 rs3024987 rs3024989“

haploviewout `MarkerList', idvariable(id) filename("VEGF") poslabel

Page 41: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The haploviewout CommandExamplelocal SNPList "rs1413711 rs3024987 rs3024989“

haploviewout `MarkerList', idvariable(id) filename("VEGF") poslabel

Page 42: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The haploviewout Command

Page 43: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The haploviewout Command

Page 44: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

The haploviewout Command

Page 45: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Summary

Compared to recreating “boutique” programs in Stata, it is relatively easy to create programs for exporting and importing data.

Page 46: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

Acknowledgements

• Grant 1-R01DK073618-02 from the National Institute of Diabetes and Digestive and Kidney Diseases

• Grant 2006-35205-16715 from the United States Department of Agriculture.

• Drs. Loren Skow, Krista Fritz, Candice Brinkmeyer-Langford of the Texas A&M College of Veterinary Medicine

• Roger Newson of the Imperial College London

Page 47: Stata commands for moving data between PHASE and HaploView Stata Conference DC ‘09 July 30-31, 2009 John Charles “Chuck” Huber Jr, PhD Assistant Professor.

References• Barrett, J., Fry, B., Maller, J., & Daly, M. (2005). Haploview: analysis and

visualization of LD and haplotype maps. Bioinformatics, 21, 263-265.• Hartl, D.L., Jones, E.W. (1998) Genetics: Principles and Analysis, 4th Ed.

Jones & Bartlett Publishers• Stephens, M., & Donnelly, P. (2003). A Comparison of Bayesian Methods

for Haplotype Reconstruction from Population Genotype Data. American Journal of Human Genetics, 73, 1162–1169.

• Stephens, M., Smith, N. J., & Donnelly, P. (2001). A New Statistical Method for Haplotype Reconstruction from Population Data. American Journal of Human Genetics, 68, 978–989.

• Watson, J.D., Baker, T.A., Bell, S.P., Gann, A., Levine, M., Losick, R. (2004) Molecular Biology of the Gene, 5th Ed. Benjamin Cummings