Top Banner
Categorical Data Frühling Rijsdijk & Kate Morley Twin Workshop, Boulder Tuesday March 7, 2006
46

Frhling Rijsdijk Kate Morley

Jan 19, 2018

Download

Documents

Alaina Patrick

Aims Introduce Categorical Data Define liability and describe assumptions of the liability model Show how heritability of liability can be estimated from categorical twin data Practical exercises
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Frhling Rijsdijk  Kate Morley

Categorical DataFrühling Rijsdijk & Kate Morley

Twin Workshop, Boulder Tuesday March 7, 2006

Page 2: Frhling Rijsdijk  Kate Morley

Aims• Introduce Categorical Data• Define liability and describe assumptions

of the liability model• Show how heritability of liability can be

estimated from categorical twin data• Practical exercises

Page 3: Frhling Rijsdijk  Kate Morley

Qualitative, Categorical,

Discrete

Quantitative, Continuous

IntervalScale

RatioScale

Qualitative, Categorical,

Discrete

Quantitative, Continuous

Dichotomous, Binary

(2 categories)

Dichotomous, Binary

(2 categories)

Polychotomous , (>2 categories)

Polychotomous (>2 categories)

Nominal (no Ranking)

Nominal (no Ranking)

Ordinal(Ranking)

Ordinal(Ranking)

IntervalScale

RatioScale

only two possible outcomes: e.g. yes / no response on an item; depression / no depression.

number of children in a family, Income level

Gender: 1=males, 2 = femalesmarital status: 1=married, 2=divorced, 3=separated

Measurement Scales of Outcome Variables

e.g. IQ, Temperature: outcomes mutually exclusive, logically ordered, differences meaningful, but zero point is arbitrary (e.g. 0 ºC is melting point water). We cannot say 80 ºC is twice as warm as 40 ºC, or having an IQ of 100 means being twice as smart as one of 50.

Highest level of measurement e.g. height, weight. Outcomes are mutually exclusive, there is a logical order, differences are meaningful and zero means absence of the trait. We can say e.g. a tree of 3 M is twice as high as one of 1.5

Page 4: Frhling Rijsdijk  Kate Morley

Ordinal dataMeasuring instrument is able to only discriminate between two or a few ordered categories e.g. absence or presence of a disease. Data take the form of counts, i.e. the number of individuals within each category:

Of 100 individuals:

90 ‘no’10 ‘yes’

‘no’

‘no’

‘yes’

‘yes’

8

19

18

55

Page 5: Frhling Rijsdijk  Kate Morley

Univariate Normal Distribution of Liability

Assumptions: (1) Underlying normal distribution of liability

(2) The liability distribution has 1 or more thresholds (cut-offs)

Page 6: Frhling Rijsdijk  Kate Morley

The standard Normal distribution

Liability is a latent variable, the scale is arbitrary,distribution is, therefore, assumed to be a Standard Normal Distribution (SND) or z-distribution:

• mean () = 0 and SD () = 1• z-values are the number of SD away from the mean• area under curve translates directly to probabilities

> Normal Probability Density function ()

-3 3-1 0 1 2-2

68%

Page 7: Frhling Rijsdijk  Kate Morley

Standard Normal Cumulative Probability in right-hand tail(For negative z values, areas are found by symmetry)

Z0 Area

0 .50 50%.2 .42 42%.4 .35 35%.6 .27 27%.8 .21 21% 1 .16 16%1.2 .12 12%1.4 .08 8%1.6 .06 6%1.8 .036 3.6%2 .023 2.3%2.2 .014 1.4%2.4 .008 .8%2.6 .005 .5%2.8 .003 .3%2.9 .002 .2%

12

1 )1,0;( dLLTZ

-3 3zT

Area=P(z zT)

0

Page 8: Frhling Rijsdijk  Kate Morley

Example: From counts find z-value in Table

Z0 Area

.6 .27 27%

.8 .21 21% 1 .16 16%1.2 .12 12%1.4 .08 8%1.6 .055 6%1.8 .036 3.6%2 .023 2.3%2.2 .014 1.4%2.4 .008 .8%2.6 .005 .5%2.8 .003 .3%2.9 .002 .2%

For one variable it is possible to find a z-value (threshold) on the SND, so that the proportion exactly matches the observed proportion of the sample e.g. if from a sample of 1000 individuals, 120 have met a criteria for a disorder (12%): the z-value is 1.2

-3 31.20

unaff aff

Counts: 880 120

Page 9: Frhling Rijsdijk  Kate Morley

Two categorical traits: Data from twins

In an unselected sample of twins > ContingencyTable with 4 observed cells:

cell a:number of pairs concordant for unaffectedcell d: number of pairs concordant for affectedcell b/c: number of pairs discordant for the disorder

Twin1Twin2 0 1

0 545 (.76)

75(.11)

1 56(.08)

40(.05)

0 = unaffected1 = affected

Page 10: Frhling Rijsdijk  Kate Morley

Joint Liability Model for twin pairs

• Assumed to follow a bivariate normal distribution, where both traits have a mean of 0 and standard deviation of 1, but the correlation between them is unknown.

• The shape of a bivariate normal distribution is determined by the correlation between the traits

Page 11: Frhling Rijsdijk  Kate Morley

Bivariate Normal

r =.00 r =.90

Page 12: Frhling Rijsdijk  Kate Morley

Bivariate Normal (R=0.6) partitioned at threshold 1.4 (z-value) on both liabilities

Page 13: Frhling Rijsdijk  Kate Morley

How are expected proportions calculated?

By numerical integration of the bivariate normal over two dimensions: the liabilities for twin1 and twin2 e.g. the probability that both twins are affected :

Φ is the bivariate normal probability density function, L1 and L2 are the liabilities of twin1 and twin2, with means 0, and is the correlation matrix of the two liabilitiesT1 is threshold (z-value) on L1, T2 is threshold (z-value) on L2

2121

1 2

),;,( dLdLLLT T

Σ0

Page 14: Frhling Rijsdijk  Kate Morley

(0 1)

(1 0)

(0 0) (1 1)

2121

1 2

),;,( dLdLLLT T

Σ0

2121

1

2

),;,( dLdLLLT

T

Σ0

2121

1

2

),;,( dLdLLLT

T

Σ0

2121

1 2

),;,( dLdLLLT T

Σ0

Page 15: Frhling Rijsdijk  Kate Morley

How is numerical integration performed?

There are programmed mathematical subroutines that can do these

calculations

Mx uses one of them

Page 16: Frhling Rijsdijk  Kate Morley

Liab 2Liab 1 0 1

0 .87

.05

1 .05

.03

Expected Proportions of the BN, for R=0.6, Th1=1.4, Th2=1.4

Page 17: Frhling Rijsdijk  Kate Morley

How can we estimate correlations from CT?

The correlation (shape) of the BN and the two thresholds determine the relative proportions of observations in the 4 cells of the CT.

Conversely, the sample proportions in the 4 cells can be used to estimate the correlation and the thresholds.

ad

bc

acbd

Twin2Twin1 0 1

0 a b1 c d

Page 18: Frhling Rijsdijk  Kate Morley

SummaryIt is possible to estimate a correlation between categorical traits from simple counts (CT), because of the assumptions we make about their joint distributions:The Bivariate Normal

The relative sample proportions in the 4 cells are translated to proportions under the BN so that the most likely correlation and the thresholds are derived

Page 19: Frhling Rijsdijk  Kate Morley

ACE Liability Model

11

Twin 1

C EA

L

C AE

L

Twin 2

Unaf ¯Aff Unaf ¯Aff

11/.5

Page 20: Frhling Rijsdijk  Kate Morley

How can we fit ordinal data in Mx?

Summary statistics: CTMx has a built-in fit function for the maximum-likelihood analysis of 2-way Contingency Tables>analyses limited to only two variables

Raw data analyses- multivariate- handles missing data- moderator variables

Page 21: Frhling Rijsdijk  Kate Morley

ML of RAW Ordinal dataIs the sum of the likelihood of all observations. The likelihood of an observation is the expected proportion in the corresponding cell of the MN.

The sum of the log-likelihoods of all observations is a value that (like for continuous data) is not very interpretable, unless we compare it with the LL of other models or a saturated model to get a chi-square index.

Page 22: Frhling Rijsdijk  Kate Morley

Raw Ordinal Dataordinal ordinal

Zyg respons1 respons21 0 01 0 01 0 12 1 02 0 01 1 12 . 12 0 .2 0 1

NOTE: smallest category should always be 0 !!

Page 23: Frhling Rijsdijk  Kate Morley

SORT !We can speed up computation time considerably when the data is sorted since if case i+1 = case i, then likelihood is NOT recalculated.In e.g. the bivariate, 2 category case, there are

only 4 possible vectors of observations : 1 1, 0 1, 1 0, 00 and, therefore, only 4 integrals for Mx to calculate if the data file is sorted.

Page 24: Frhling Rijsdijk  Kate Morley

Practical

Page 25: Frhling Rijsdijk  Kate Morley

Sample and Measures• Australian Twin Registry data (QIMR)

• Self-report questionnaire– Non-smoker, ex-smoker, current smoker

– Age of smoking onset

• Large sample of adult twins

+ family members– Today using MZMs (785 pairs)

and DZMs (536 pairs)

Page 26: Frhling Rijsdijk  Kate Morley

• Variable: age at smoking onset, including non-smokers

• Ordered as:– Non-smokers / late onset / early onset

0

200

400

600

800

1000

1200

1400

Non-smoker Onset after 18 Onset before or by 18

Page 27: Frhling Rijsdijk  Kate Morley

Practical ExerciseAnalysis of age of onset data- Estimate thresholds- Estimate correlations- Fit univariate model

Observed counts from ATR data:

MZM DZM 0 1 2 0 1 2

0 368 24 46 0 203 22 631 26 15 21 1 17 5 162 54 22 209 2 65 12 133

Page 28: Frhling Rijsdijk  Kate Morley

Threshold Specification in Mx

Threshold Model T /

2 CategoriesMatrix T: 1 x 2T(1,1) T(1,2) threshold 1 for twin1 & twin2

-3 30-1

Page 29: Frhling Rijsdijk  Kate Morley

Threshold Specification in Mx

Threshold Model L*T /1 01 1

t11 t12t21 t22

* =t11 t12t11 + t21 t12 + t22

3 CategoriesMatrix T: 2 x 2T(1,1) T(1,2) threshold 1 for twin1 & twin2T(2,1) T(2,2) increment

-3 31.20-1

2.2

Page 30: Frhling Rijsdijk  Kate Morley

polycor_smk.mx

#define nvarx2 2 #define nthresh 2 #ngroups 2

G1: Data and model for MZM correlationDAta NInput_vars=3 Missing=.Ordinal File=smk_prac.ordLabels zyg ageon_t1 ageon_t2SELECT IF zyg = 2SELECT ageon_t1 ageon_t2 /

Begin Matrices; R STAN nvarx2 nvarx2 FREET FULL nthresh nvarx2 FREEL Lower nthresh nthreshEnd matrices;

Value 1 L 1 1 to L nthresh nthresh

Page 31: Frhling Rijsdijk  Kate Morley

polycor_smk.mx

#define nvarx2 2 ! Number of variables x number of twins#define nthresh 2 ! Number of thresholds=num of cat-1#ngroups 2

G1: Data and model for MZM correlationDAta NInput_vars=3 Missing=.Ordinal File=smk_prac.ord ! Ordinal data fileLabels zyg ageon_t1 ageon_t2SELECT IF zyg = 2SELECT ageon_t1 ageon_t2 /

Begin Matrices; R STAN nvarx2 nvarx2 FREET FULL nthresh nvarx2 FREEL Lower nthresh nthreshEnd matrices;

Value 1 L 1 1 to L nthresh nthresh

Page 32: Frhling Rijsdijk  Kate Morley

polycor_smk.mx

#define nvarx2 2 ! Number of variables per pair#define nthresh 2 ! Number of thresholds=num of cat-1#ngroups 2

G1: Data and model for MZM correlationDAta NInput_vars=3 Missing=.Ordinal File=smk_prac.ord ! Ordinal data fileLabels zyg ageon_t1 ageon_t2SELECT IF zyg = 2SELECT ageon_t1 ageon_t2 /

Begin Matrices; R STAN nvarx2 nvarx2 FREE ! Correlation matrixT FULL nthresh nvarx2 FREEL Lower nthresh nthreshEnd matrices;

Value 1 L 1 1 to L nthresh nthresh

Page 33: Frhling Rijsdijk  Kate Morley

polycor_smk.mx

#define nvarx2 2 ! Number of variables per pair#define nthresh 2 ! Number of thresholds=num of cat-1#ngroups 2

G1: Data and model for MZM correlationDAta NInput_vars=3 Missing=.Ordinal File=smk_prac.ord ! Ordinal data fileLabels zyg ageon_t1 ageon_t2SELECT IF zyg = 2SELECT ageon_t1 ageon_t2 /

Begin Matrices; R STAN nvarx2 nvarx2 FREE ! Correlation matrixT FULL nthresh nvarx2 FREE ! thresh tw1, thresh tw2L Lower nthresh nthresh ! Sums threshold displacementsEnd matrices;

Value 1 L 1 1 to L nthresh nthresh ! initialize L

Page 34: Frhling Rijsdijk  Kate Morley

COV R /

Thresholds L*T /

Bound 0.01 1 T 1 1 T 1 2Bound 0.1 5 T 2 1 T 2 2Start 0.2 T 1 1 T 1 2Start 0.2 T 2 1 T 2 2Start .6 R 2 1 Option RSOption func=1.E-10END

Page 35: Frhling Rijsdijk  Kate Morley

COV R / ! Predicted Correlation matrix for MZ pairs

Thresholds L*T / ! Threshold model, to ensure t1>t2>t3 etc.......

Bound 0.01 1 T 1 1 T 1 2Bound 0.1 5 T 2 1 T 2 2Start 0.2 T 1 1 T 1 2Start 0.2 T 2 1 T 2 2Start .6 R 2 1 Option RSOption func=1.E-10END

Page 36: Frhling Rijsdijk  Kate Morley

COV R / ! Predicted Correlation matrix for MZ pairs

Thresholds L*T / ! Threshold model, to ensure t1>t2>t3 etc.......

Bound 0.01 1 T 1 1 T 1 2Bound 0.1 5 T 2 1 T 2 2 ! Ensures positive threshold displacementStart 0.2 T 1 1 T 1 2 ! Starting values for the 1st thresholdsStart 0.2 T 2 1 T 2 2 ! Starting values for the 2nd thresholdsStart .6 R 2 1 ! Starting value for the correlationOption RSOption func=1.E-10 !function precision is less than usualEND

Page 37: Frhling Rijsdijk  Kate Morley

! Test equality of thresholds between Tw1 and Tw2EQ T 1 1 1 T 1 1 2 !constrain TH1 to be equal across Tw1 and Tw2 MZM EQ T 1 2 1 T 1 2 2 !constrain TH2 to be equal across Tw1 and Tw2 MZM EQ T 2 1 1 T 2 1 2 !constrain TH1 to be equal across Tw1 and Tw2 DZM EQ T 2 2 1 T 2 2 2 !constrain TH2 to be equal across Tw1 and Tw2 DZM End

Get cor.mxs

! Test equality of thresholds between MZM & DZMEQ T 1 1 1 T 1 1 2 T 2 1 1 T 2 1 2 !constrain TH1 to be equal across all MalesEQ T 1 2 1 T 1 2 2 T 2 2 1 T 2 2 2 !constrain TH2 to be equal across all MalesEnd

Page 38: Frhling Rijsdijk  Kate Morley

Exercise I• Fit saturated model

– Estimates of thresholds– Estimates of polychoric correlations

• Test equality of thresholds– Examine differences in threshold and correlation estimates for

saturated model and sub-models

• Examine correlations– What model should we fit?

Raw ORD File: smk_prac.dat Script: polychor_smk.mx Location: kate\Ordinal_Practical

Page 39: Frhling Rijsdijk  Kate Morley

Estimates: smoking age-at-onset

-2LL df Twin 1

Twin 2

Twin 1

Twin 2

Saturated5128.185

3055 Th1 MZ 0.09 0.12 DZ 0.03 0.05

Th2 0.31 0.33 0.24 0.26Cor 1 1

0.81 1 0.55 1

Page 40: Frhling Rijsdijk  Kate Morley

Estimates: smoking age-at-onsetΔХ2 Δ df P Twin

1Twin 2

Twin 1

Twin 2

Sub-model 1Th1 MZ DZTh2Cor

Sub-model 2Th1 MZ DZTh2Cor

Page 41: Frhling Rijsdijk  Kate Morley

Estimates: smoking age-at-onsetΔХ2 Δ df P Twin

1Twin 2

Twin 1

Twin 2

Sub-model 10.77 4 0.94 Th1 MZ 0.10 0.10 DZ 0.04 0.04

Th2 0.32 0.32 0.25 0.25Cor 1 1

0.81 1 0.55 1Sub-model 22.44 6 0.88 Th1 MZ 0.07 0.07 DZ 0.07 0.07

Th2 0.29 0.29 0.29 0.29Cor 1 1

0.81 1 0.55 1

Page 42: Frhling Rijsdijk  Kate Morley

ACEcat_smk.mx#define nvar 1 ! number of variables per twin#define nvarx2 2 ! number of variables per pair#define nthresh 1 ! number of thresholds=num of cat-1#ngroups 4 ! number of groups in script

G1: Parameters for the Genetic modelCalculation Begin Matrices;X Low nvar nvar FREE ! Additive genetic path coefficientY Low nvar nvar FREE ! Common environmental path coefficientZ Low nvar nvar FREE ! Unique environmental path coefficientEnd matrices;Begin Algebra;A=X*X' ; !Additive genetic variance (path X squared)C=Y*Y' ; !Common Environm variance (path Y squared)E=Z*Z' ; !Unique Environm variance (path Z squared)End Algebra;start .6 X 1 1 Y 1 1 Z 1 1 !starting value for X, Y, ZInterval @95 A 1 1 C 1 1 E 1 1 !requests the 95%CI for h2, c2, e2 End

Page 43: Frhling Rijsdijk  Kate Morley

G2: Data and model for MZ pairsDAta NInput_vars=3 Missing=.Ordinal File=prac_smk.ordLabels zyg ageon_t1 ageon_t2SELECT IF zyg = 2SELECT ageon_t1 ageon_t2 /Matrices = group 1T FULL nthresh nvarx2 FREE ! Thresh tw1, thresh tw2L Lower nthresh nthresh

COV ! Predicted covariance matrix for MZ pairs ( A + C + E | A + C _ A + C | A + C + E ) /Thresholds L*T / !Threshold model

Bound 0.01 1 T 1 1 T 1 2 ! Ensures positive threshold displacementBound 0.1 5 T 2 1 T 2 2Start 0.1 T 1 1 T 1 2 ! Starting values for the 1st thresholdsStart 0.2 T 1 1 T 1 2 ! Starting values for the 2nd thresholdsOption rsEnd

Page 44: Frhling Rijsdijk  Kate Morley

G3: Data and model for DZ pairsDAta NInput_vars=4 Missing=.Ordinal File=prac_smk.ordLabels zyg ageon_t1 ageon_t2SELECT IF zyg = 4SELECT ageon_t1 ageon_t2 /Matrices = group 1T FULL nthresh nvarx2 FREE ! Thresh tw1, thresh tw2L Lower nthresh nthreshH FULL 1 1 ! .5

COVARIANCE ! Predicted covariance matrix for DZ pairs ( A + C + E | H@A + C _ H@A + C | A + C + E ) /Thresholds L*T / !Threshold model

Bound 0.1 1 T 1 1 T 1 2 ! Ensures positive threshold displacementBound 0.1 5 T 2 1 T 2 2Start 0.1 T 1 1 T 1 2 ! Starting values for the 1st thresholdsStart 0.2 T 1 1 T 1 2 ! Starting values for the 2nd thresholdsOption rsEnd

Page 45: Frhling Rijsdijk  Kate Morley

G4: CONSTRAIN VARIANCES OF OBSERVED VARIABLES TO 1CONSTRAINTMatrices = Group 1I UNIT 1 1CO A+C+E= I / !constrains the total variance to equal 1Option func=1.E-10 End

Constraint groups and degrees of freedom

As the total variance is constrained to unity, we can estimate one VC from the other two, giving us one less independent parameter:

A + C + E = 1 therefore E = 1 - A - C

So each constraint group adds a degree of freedom to the model.

Page 46: Frhling Rijsdijk  Kate Morley

Exercise II• Fit ACE model

– What does the threshold model look like?– Change it to reflect the findings from

exercise I

Raw ORD File: smk_prac.dat Script: ACEcat_smk.mx Location: kate\Ordinal_Practical