Top Banner
A primer in A primer in Biostatistics Biostatistics Christina M. Ramirez Christina M. Ramirez UCLA Department of UCLA Department of Biostatistics Biostatistics
40

A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Dec 22, 2015

Download

Documents

Ross Bradley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

A primer in A primer in BiostatisticsBiostatistics

Christina M. RamirezChristina M. Ramirez

UCLA Department of UCLA Department of BiostatisticsBiostatistics

Page 2: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

StatisticsStatistics

Data CollectionData Collection Summarizing DataSummarizing Data Interpreting DataInterpreting Data Drawing Conclusions from DataDrawing Conclusions from Data

Page 3: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

PopulationPopulation

The set of data (numerical or The set of data (numerical or otherwise) corresponding to the otherwise) corresponding to the entire collection of units about entire collection of units about

which information is soughtwhich information is sought

Example: Unemployment - Status Example: Unemployment - Status of ALL employable people of ALL employable people

(employed, unemployed) in the (employed, unemployed) in the country.country.

Page 4: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

SampleSample

A subset of the population data A subset of the population data that are actually collected in the that are actually collected in the

course of a study.course of a study.

Example: Unemployment - Example: Unemployment - Status of the 1000 employable Status of the 1000 employable

people interviewed.people interviewed.

Page 5: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Population vs. SamplePopulation vs. Sample

Population

Sample

In most studies, it is difficult to obtain information from the entire population. We rely on samples to

make estimates or inferences related to the population.

Page 6: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Descriptive Descriptive statisticsstatistics

Describing data with Describing data with numbers:numbers:

measures of locationmeasures of location

Page 7: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

What to describe?What to describe?

What is the What is the ““locationlocation”” or or ““centercenter”” of the of the data? (data? (““measures of locationmeasures of location””)) MeanMean MedianMedian ModeMode

How do the data vary? (How do the data vary? (““measures of measures of variabilityvariability””)) RangeRange Interquartile RangeInterquartile Range VariantVariant

Page 8: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

MeanMean

Another name for average.Another name for average. Appropriate for describing Appropriate for describing

measurement data.measurement data. Seriously affected by unusual values Seriously affected by unusual values

called called ““outliersoutliers””..

Page 9: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Calculating Sample Calculating Sample MeanMean

Add up all of the data points and divide by the number of data points.

Number of drinks/day: 2 8 3 4 1

Sample Mean = (2+8+3+4+1)/5 = 3.6

Example:

Page 10: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

MedianMedian

Another name for 50th percentile.Another name for 50th percentile. Appropriate for describing Appropriate for describing

measurement data.measurement data. ““Robust to outliersRobust to outliers,,”” that is, not that is, not

affected much by unusual values.affected much by unusual values.

Page 11: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Calculating Sample Calculating Sample MedianMedian

Order data from smallest to largest.

If odd number of data points, the median is the middle value.

Number of drinks/day: 2 8 3 4 1

Ordered Data: 1 2 3 4 8

Median

Page 12: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

ModeMode

The value that occurs most frequently.The value that occurs most frequently. One data set can have many modes. One data set can have many modes. Appropriate for all types of data, but Appropriate for all types of data, but

most useful for categorical data or most useful for categorical data or discrete data with only a few number of discrete data with only a few number of possible values.possible values.

Example: Number of eyes affected with Example: Number of eyes affected with cataracts in 70 year olds: 0, 1, 2.cataracts in 70 year olds: 0, 1, 2.

Page 13: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

The most appropriate measure of location depends on …

the shape of the data’s distribution.

Page 14: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Most appropriate Most appropriate measure of locationmeasure of location

Depends on whether or not data are Depends on whether or not data are ““symmetricsymmetric”” or or ““skewedskewed””..

Depends on whether or not data Depends on whether or not data have one (have one (““unimodalunimodal””) or more ) or more (( ““multimodalmultimodal””) modes.) modes.

Page 15: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Symmetric and UnimodalSymmetric and Unimodal

2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

0

10

20

GPAs

Per

cent

Page 16: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Symmetric and BimodalSymmetric and Bimodal

Page 17: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Symmetric and BimodalSymmetric and Bimodal

Page 18: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Skewed RightSkewed Right

0 100 200 300 400

0

10

20

Number of Music CDs

Fre

quen

cyNumber of Music CDs of Spring 1998 Stat 250 Students

Page 19: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Choosing Appropriate Choosing Appropriate Measure of LocationMeasure of Location

If data are symmetric, the mean, If data are symmetric, the mean, median, and mode will be median, and mode will be approximately the same.approximately the same.

If data are multimodal, report the If data are multimodal, report the mean, median and/or mode mean, median and/or mode for each for each subgroupsubgroup..

If data are skewed, report the If data are skewed, report the median.median.

Page 20: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Descriptive Descriptive statisticsstatistics

Describing data with Describing data with numbers: numbers: measures of measures of

variabilityvariability

• RangeRange• Interquartile rangeInterquartile range• Variance and standard Variance and standard

deviationdeviation

Page 21: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

RangeRange

The difference between largest and The difference between largest and smallest data point.smallest data point.

Highly affected by outliers.Highly affected by outliers. Best for symmetric data with no Best for symmetric data with no

outliers.outliers.

Page 22: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

What is the range?What is the range?

2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

0

10

20

GPA

Fre

quency

GPAs of Spring 1998 Stat 250 Students

Page 23: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Interquartile rangeInterquartile range

The difference between the The difference between the ““third third quartilequartile”” (75th percentile) and the (75th percentile) and the ““first quartilefirst quartile”” (25th percentile). (25th percentile). So, the So, the ““middle-halfmiddle-half”” of the values. of the values.

IQR = Q3-Q1IQR = Q3-Q1 Robust to outliers or extreme Robust to outliers or extreme

observations.observations. Works well for skewed data.Works well for skewed data.

Page 24: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Interquartile rangeInterquartile range

Descriptive Statistics

Variable N Mean Median TrMean StDev SE MeanGPA 92 3.0698 3.1200 3.0766 0.4851 0.0506

Variable Minimum Maximum Q1 Q3GPA 2.0200 3.9800 2.6725 3.4675

IQR = 3.4675 - 2.6725 = 0.795

Page 25: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

VarianceVariance

1. Find difference between each data point and mean.

2. Square the differences, and add them up.

3. Divide by one less than the number of data points.

Page 26: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

VarianceVariance

If measuring variance of population, If measuring variance of population, denoted by denoted by 22 ( (““sigma-squaredsigma-squared””).).

If measuring variance of sample, If measuring variance of sample, denoted by denoted by ss2 2 (( ““s-squareds-squared””).).

Measures average squared deviation Measures average squared deviation of data points from their mean.of data points from their mean.

Highly affected by outliers. Best for Highly affected by outliers. Best for symmetric data.symmetric data.

Page 27: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Standard deviationStandard deviation

Sample standard deviation is square Sample standard deviation is square root of sample variance, and so is root of sample variance, and so is denoted by denoted by ss..

Units are the original units.Units are the original units. Measures average deviation of data Measures average deviation of data

points from their mean.points from their mean. Also, highly affected by outliers.Also, highly affected by outliers.

Page 28: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

What is the variance What is the variance or standard deviation?or standard deviation?

120 170 220 270

KPH

Fastest Ever Driving Speed

Sex

female

male

Page 29: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Variance or standard Variance or standard deviationdeviation

Sex N Mean Median TrMean StDev SE Mean female 126 152.05 150.00 151.39 18.86 1.68 male 100 177.98 183.33 176.04 28.98 2.90

Sex Minimum Maximum Q1 Q3female 108.33 200.00 141.67 163.75male 125.00 270.00 158.33 197.92

Females: s = 18.86 kph and s2 = 18.862 = 355.7 kph2

Males: s = 28.98 kph and s2 = 28.982 = 839.8 kph2

Page 30: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

The most appropriate measure of variability depends on …

the shape of the data’s distribution.

Page 31: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Choosing Appropriate Choosing Appropriate Measure of VariabilityMeasure of Variability

If data are symmetric, with no If data are symmetric, with no serious outliers, use range and serious outliers, use range and standard deviation.standard deviation.

If data are skewed, and/or have If data are skewed, and/or have serious outliers, use IQR.serious outliers, use IQR.

Page 32: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

ProbabilityProbability

The “p” in p-valueThe “p” in p-value

Page 33: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Examples: Coin FlipsExamples: Coin Flips

Flips #(Flips) #(Heads) P(H)

Ben 4,040 2,048 0.5069

Christina 24,000 12,012 0.5005

Roger 10,000 5,067 0.5067

Page 34: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Probability Probability ConceptsConceptsRandomness, Randomness,

Independence, Independence,

Multiplication RuleMultiplication Rule

Page 35: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Thought Question 1Thought Question 1

What does it mean to say that a What does it mean to say that a deck of cards is deck of cards is ““randomlyrandomly”” shuffled?shuffled? Every ordering of the cards is Every ordering of the cards is

equally likelyequally likelyThere are 8 followed by 67 zeros There are 8 followed by 67 zeros

possible orderings of a 52 card deckpossible orderings of a 52 card deck Every card has the same probability Every card has the same probability

to end up in any specified locationto end up in any specified location

Page 36: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

The question continuedThe question continued

A 52 card deck is randomly A 52 card deck is randomly shuffledshuffled

How often will the tenth card How often will the tenth card down from the top be a Club?down from the top be a Club? 1/4 of the time1/4 of the time Every card has the same chance to Every card has the same chance to

end up 10th. There are 13 clubs end up 10th. There are 13 clubs and 13 / 52 = 1/4and 13 / 52 = 1/4

Page 37: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

More of the questionMore of the question

Deck had three cards - labeled A, Deck had three cards - labeled A, B, CB, C

After a random shuffle, cards are After a random shuffle, cards are turned over one at a time.turned over one at a time.

How often is the A card the How often is the A card the second card thatsecond card that’’s turned over?s turned over? 1/3 : each card had the same 1/3 : each card had the same

chance to end up in a specific chance to end up in a specific positionposition

Page 38: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Thought Question 2Thought Question 2

A fair die is rolled many times. A fair die is rolled many times. How often will a How often will a ““11”” be the result? be the result? AboutAbout 1/6 of the time, but there will be 1/6 of the time, but there will be

some sampling errorsome sampling error

•How does increasing the number of How does increasing the number of rolls affect the difference between rolls affect the difference between sample fraction of sample fraction of ““11”’”’s and 1/6?s and 1/6? Difference likely to get smaller as n Difference likely to get smaller as n

increases since margin of error goes increases since margin of error goes downdown

Page 39: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Does a prior event Does a prior event matter?matter?

A fair coin is flipped four times.A fair coin is flipped four times.First three flips are headsFirst three flips are headsWhatWhat’’s the probability that the s the probability that the

fourth flip is heads?fourth flip is heads?1/2 assuming flips are 1/2 assuming flips are

independentindependent Results of first three flips donResults of first three flips don’’t t

mattermatter

Page 40: A primer in Biostatistics Christina M. Ramirez UCLA Department of Biostatistics.

Does prior event matter?Does prior event matter?

Ten cards are drawn Ten cards are drawn without without replacementreplacement from 52 card deck. from 52 card deck.

2 Aces are among these 10 cards2 Aces are among these 10 cardsWhatWhat’’s the probability the s the probability the

eleventh card is an Ace?eleventh card is an Ace?2/42 = 1/212/42 = 1/21

After ten draws, 42 cards remain, 2 After ten draws, 42 cards remain, 2 of them are Aces of them are Aces