Top Banner
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility
26

BMS 617

Feb 23, 2016

Download

Documents

Yannis

BMS 617. Lecture 11: Models. What is a model?. In general, a model is a (simpler) representation of something else We use models to study complex phenomena Easier to manipulate than the real thing of interest Easier to focus on specific aspects - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BMS 617

Marshall University Genomics Core Facility

Marshall University School of MedicineDepartment of Biochemistry and Microbiology

BMS 617

Lecture 11: Models

Page 2: BMS 617

Marshall University School of Medicine

What is a model?

• In general, a model is a (simpler) representation of something else– We use models to study complex phenomena– Easier to manipulate than the real thing of interest– Easier to focus on specific aspects– E.g. we use mouse models to study human disease• Easier to control behavior of the mouse• Easier to control genetics…

Page 3: BMS 617

Marshall University School of Medicine

What is a mathematical model?

• A mathematical model is an equation (or set of equations) that describes a physical state or process– Describes how values in the state or process are

related to each other• Aim is not to provide a perfect model– A good model is simple enough to be easy to

understand– Yet complex enough to be useful

Page 4: BMS 617

Marshall University School of Medicine

Statistical Models

• Statistical models are mathematical models that model both the ideal predictions and the random “scatter” or “noise”– Model both the population values and the

“random” variation from the population values• “Random” variation is really just variation not explained

or accounted for by the model

Page 5: BMS 617

Marshall University School of Medicine

Model terminology

• A model is an equation (or set of equations)• The equation defines the outcome, or dependent

variable as a function of– one or more independent variables, and– one or more parameters

• Each data point has its own values for the independent and dependent variables

• The values of the parameters are properties of the population– Do not vary from data point to data point

Page 6: BMS 617

Marshall University School of Medicine

Fitting a model to data

• The parameters are properties of the population– They are unknown

• Typically, we collect a sample of data points• Assuming the model is correct, we can use the

sample to estimate the parameters of the model– This is called “fitting a model to the data”– Results in estimates and confidence intervals for each

of the parameters

Page 7: BMS 617

Marshall University School of Medicine

Simplest possible model

• The simplest possible model for a data set involves no independent variable!

• Sample values from a population• Assume the population values follow a Normal

distribution• Our model is

Page 8: BMS 617

Marshall University School of Medicine

Average as a model• In the simple model Y=μ+ε,– Y is the dependent variable

• Different value for each data point– μ is a parameter

• The mean of the population• Single, unknown value we will estimate from our data

– ε is the “random error”• Different for each data point, assumed normally distributed with mean

zero

• Can make the roles of the variable types more explicit by writingYi=μ+εi

Page 9: BMS 617

Marshall University School of Medicine

Why the mean is important

• If we assume the model is correct:– Our data are sampled from a population where the

values are some fixed value, plus some scatter that is normally distributed with mean zero

• then we want to use our data to estimate μ• It turns out that the value of μ that makes our

observed data the most likely, out of all possible choices of μ, is the mean of our data– The mean is the maximum likelihood estimate of μ

Page 10: BMS 617

Marshall University School of Medicine

A more sophisticated model: linear regression

• Remember the example from linear regression:– Measured insulin sensitivity and %C20-22 content in 13 healthy

men– Hypothesized that an increase in %C20-22 content caused an

increase in insulin sensitivity– Used linear regression to fit the model

Y = intercept + slope × X + scatterto the data• Y is the insulin sensitivity, X the %C20-22 content• In more conventional notation:

Y = β0 + β1 × X + ε, or

Yi = β0 + β1 × Xi + εi

Page 11: BMS 617

Marshall University School of Medicine

Linear regression as a statistical model

• The linear regression model has two parameters:– β0, the intercept– β1, the slope

• These are both properties of the population• We use the data to estimate them– Uses the method of “least squares”– Gives the maximum likelihood estimate for the two

parameters– The values of the parameters that maximize the chances

of our data being observed

Page 12: BMS 617

Marshall University School of Medicine

Recap of models• The linear regression in this example gave an estimate of

the slope of 37.2, and an estimate of the intercept of -486.5

• Our estimated model isInsulin sensitivity = 37.2 × %C20-22 - 486.5 + ε

• The model is not assumed to be perfect!• Simple, but powerful enough to draw some basic

conclusions– Within the range of the data, an increase in one unit in %C20-22

results, on average, in an increase in 37.2 units in insulin sensitivity

Page 13: BMS 617

Marshall University School of Medicine

Other types of model

• We will look at other types of model in upcoming lectures:– Multiple regression• More than one independent variable

– Logistic regression• Outcome variable is binary, one or more independent

variables– Proportional hazards regression• Outcome variable is survival time, one or more

independent variables

Page 14: BMS 617

Marshall University School of Medicine

Comparing Models

• In the linear regression example, we also computed a p-value– The null hypothesis was that the slope was zero– I.e. we compared the model

Y = β0 + β1 × X + εto Y = β0 + ε

• So we can think of this statistical test as the comparison between two models– In fact, we can think of most (perhaps all) statistical tests as

the comparison between two models

Page 15: BMS 617

Marshall University School of Medicine

Hypothesis test of linear regression as a comparison of models

Page 16: BMS 617

Marshall University School of Medicine

Why model comparison is not straightforward

• It is not enough just to compare the “residuals” between two models– Remember the residuals are the error terms in the

model– A model with more parameters will always come

closer to the data• However, the confidence intervals will be wider• So the model will be less useful for predicting future

values

Page 17: BMS 617

Marshall University School of Medicine

Comparing the models and R2

Hypothesis Distance measured from

Sum of squares Percentage of variation

Null Mean 155,642.3 100%

Linear relationship Straight line 63,361.37 40.7%

Difference (Improvement) 92,280.93 59.3%

• The total sum of squares of the distance of points from the mean• i.e. the total variance

• is 155,642.3. • The total sum of squares of the residuals is 63,361.37• The difference between these is 92,280.93, which is 59.3% of the total variance• So the linear model results in an improvement in the variance which is 59.3% of the

total: this is the definition of R2: R2=0.593

Page 18: BMS 617

Marshall University School of Medicine

Interpreting the difference in variance

• With a little algebra, you can show that the difference between the total variance and the sum of the squares of the residuals is the sum of the squares of the distance between the regression line and the mean

• So the regression line “accounts for 59.3% of the variance”

Page 19: BMS 617

Marshall University School of Medicine

Computing a p-value for model comparison

• To compute a p-value for the comparison of models, we look at both the sum of squares for each model and the degrees of freedom for each model– The number of degrees of freedom is the number of

data points, minus the number of parameters in the model

– We had 13 data points, so there are 12 degrees of freedom for the null hypothesis model, and 11 degrees of freedom for the linear model

Page 20: BMS 617

Marshall University School of Medicine

Mean squares and F-ratioSource of variation

Sum of squares Degrees of Freedom

Mean squares F-ratio

Regression 92,281 1 92,281 16.0

Random 63,361 11 5,760

Total 155,642 12

• The same data presented in the format of an ANOVA (we will see this later)• “Total” represents the total variation in the data• “Random” is the variation in the data around the regression line• “Regression” is the difference between them: the sum of squares of distances from

the regression line to the mean• The “mean squares” is the sum of squares divided by the degrees of freedom• The F-ratio is the ratio of mean squares

Page 21: BMS 617

Marshall University School of Medicine

Computing a p-value• The null hypothesis is that the “horizontal line model” is the

correct model– i.e. the slope in the regression model is zero

• If the null hypothesis were true, the F-ratio would be close to 1 (this is not obvious!)

• The distribution of values of the F-ratio, assuming the null hypothesis is known, is a known distribution– Called the F-distribution– depends on two different degrees of freedom– so a p-value can be computed

• The p-value in this example is p=0.0021

Page 22: BMS 617

Marshall University School of Medicine

Recap

• We re-examined the linear regression example and re-cast it as a comparison of statistical models

• Can compute a p-value for the null hypothesis that the simpler model is “correct”– “As correct as the more complex model”

• This is the same p-value we computed before• The R2 value is the proportion of variance

“explained by” the regression• We can do the same for other statistical tests!

Page 23: BMS 617

Marshall University School of Medicine

A t-test considered as a comparison of models

• Recall the GRHL2 expression in Basal-A and Basal-B cancer cells• We can re-cast this as a linear regression…

– Let x=0 for Basal A cells and x=1 for Basal B cells• Our linear model is:

Expression = β0 + β1 × x + εwith the null hypothesisExpression = β0 + ε

• What is β1?– Slope = increase in expression for increase in one unit of x– = difference in expression between Basal A and Basal B– = difference in means…

Page 24: BMS 617

Marshall University School of Medicine

t-test as a comparison of models

Page 25: BMS 617

Marshall University School of Medicine

Results of running the t-test as a comparison of models

• Running the linear regression gives estimates of the intercept of 1.933 and slope of -1.861

• The table of variances isModel Sum of

squaresDF Mean

SquaresF-ratio

Regression 23.335 1 23.335 55.993

Residual 10.419 25 0.417

Total 33.753 26

Page 26: BMS 617

Marshall University School of Medicine

Interpreting the table of variances• The total sum of squares (33.753) is the sum of squares of the differences

between each value and the overall mean– This, divided by the df (33.753/26=1.298) is the sample variance

• The residual sum of squares is the sum of the squares of each expression value minus its predicted value– The predicted value is just the mean for its basal type– This is the “within group” variance

• The regression sum of squares is the sum of squares of the differences between predicted values and the overall mean– This is the sum of squares of the differences between the group means and the

overall mean• One squared difference for each data point

• These interpretations will be really useful to consider when we study ANOVA