Biostatistics Introduction & Data Presentation (Short Version) Intro - 1 1 Biostatistics Andy Chang Youngstown State University 2 Statistics is a field of study concerned with the Producing Data 1) data collection, [Producing data] Exploring Data 2) organization, summarization, examination and providing an overview of the general features of data,[Exploring Data] Statistical Inference 3) and the drawing of inferences about a body of data (population) based on the properties of a part of the data (sample) observed. [Statistical Inference] Statistics in Broader Sense 3 In health and medical (or clinical) study, researchers investigate a sample of subjects to understand the effectiveness of a treatment or an intervention on target population. Therefore, statistical inference is very important in health and medical research. 4 Public health is fundamentally concerned with preventing disease, disability, and premature death in human population or community. Therefore, statistical inference is very important in public health research. Goal 2 of the Healthy People 2010: Eliminate Health Disparities 5 Individuals (subjects, experimental unit): the entities on which data are collected. Variable: a characteristic of interest for the individual which takes on different values in different individual. Basic Terms In Statistics Examine a variable Examine correlation between two or more variables Purpose of Statistics 6 Variable Types Quantitative Variables (numeric) [height, number of subscriptions, ...] – Continuous: a variable that has an uncountable number of possible values. (measurements) – Discrete: a variable that has a countable number of possible values.(counts) Qualitative (Categorical) Variables [hair color, gender, ...]
23
Embed
Biostatistics Introduction & Data Presentation (Short …gchang.people.ysu.edu/class/mph/note/07_1_Week2.pdf · Biostatistics Introduction & Data Presentation (Short Version) Intro
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 1
1
Biostatistics
Andy ChangYoungstown State University
2
Statistics is a field of study concerned with the
Producing Data
1) data collection, [Producing data]
Exploring Data
2) organization, summarization, examinationand providing an overview of the general features of data, [Exploring Data]
Statistical Inference
3) and the drawing of inferences about a body of data (population) based on the properties of a part of the data (sample) observed. [Statistical Inference]
Statistics in Broader Sense
3
In health and medical (or clinical) study, researchers investigate a sample of subjects to understand the effectiveness of a treatment or an intervention on target population.
Therefore, statistical inference is very important in health and medical research.
4
Public health is fundamentally concerned with preventing disease, disability, and premature death in human population or community.Therefore, statistical inference is very important in public health research.
Goal 2 of the Healthy People 2010: Eliminate Health Disparities
5
Individuals (subjects, experimental unit): the entities on which data are collected.
Variable: a characteristic of interest for the individual which takes on different values in different individual.
Basic Terms In Statistics
Examine a variableExamine correlation between two or more variables
Purpose of Statistics
6
Variable Types
Quantitative Variables (numeric) [height, number of subscriptions, ...]– Continuous: a variable that has an
uncountable number of possible values. (measurements)
– Discrete: a variable that has a countable number of possible values.(counts)
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 2
7
Measurement Scales
• Nominal: consists of labels, names or categories.• Ordinal: data that the order or rank is meaningful.• Interval: numerical data that arithmetic operations
are meaningful.• Ratio: data that the ratio of two data is
meaningful.
8
Producing Data
9
Data in Public Health :
Vital Statistics and the CensusPublic Health SurveillanceSurveyRegistriesEpidemic InvestigationsResearchProgram Evaluations
What type of statistical technique is appropriate for Data Presentation?
Categorical variable?Quantitative variable?
Data Presentation
12
Data Sheet (Raw data)
ID Height(in) Weight(lb) BirthMonth Exp. Gender1 6 135 4 H F 2 63 119 9 H F 3 72 175 11 T M 4 60 106 9 H F 5 65 135 8 T F 6 72 170 10 H M 7 64 180 8 H F 8 71 205 10 H M ...
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 3
13
(A complete list)ID Height Weight BirthMonth Exp. Gender1 6 135 4 H F 2 63 119 9 H F 3 72 175 11 T M 4 60 106 9 H F 5 65 135 8 T F 6 72 170 10 H M 7 64 180 8 H F 8 71 205 10 H M 9 75 195 6 T M 10 71 185 8 H M 11 71 182 6 T M 12 65 108 8 T F 13 73 150 4 H M 14 67 128 6 T F 15 74 175 6 H M 16 66 160 9 H F 17 65 143 9 T F 18 72 190 11 T M 19 64 180 2 H M 20 61 195 5 T M 21 72 220 7 H M 22 69 285 7 H M
14
Grouping and Displaying A Categorical Variable
15
Frequency Table and Charts(One Categorical Variable)
Class Frequency Relative Frequency Female 9 9/22 = .409 = 40.9% Male 13 13/22 = .591 = 59.1% Total 22 100%
Classes: Categories for grouping data.Frequency (class frequency): The number of data values
in a class.Relative frequency: The ratio of the frequency of a class
to the total number of pieces of data.Frequency distribution: A listing of classes and their
frequencies.Relative Frequency distribution: A listing of classes
and their relative frequencies.Upper class limit: The largest value that can go in a class.Lower class limit: The smallest value that can go in a class.Class width: The difference between the lower class limit
of the given class and the lower class limit of the next higher class.
Class midpoint (class mark): The midpoint of a class.22
Guidelines for grouping data:(for quantitative variable)
There should be between five and twenty classes.Each piece of data must belong to one, and only one, class.(Mutually Exclusive)Whenever feasible, all classes should have the same width.
23
To build a Frequency Table:Find the range of the data: Range = Largest value – smallest valueUse the range and try different class width to determine how many classes you need to make frequency table or histogram.
Student data example:Range = 285 – 106 = 179/20 ≈ 9If using a class width of 20, there’ll be about 9 classes which is good.
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 5
25
Histogram (SPSS)
26
Polygon (SPSS)
27
Polygon (SPSS)
28
100 120 140 160 180 200 220 240 260 280 300
100%
50%
Cumulative R. F. Histogram
29
90 110 130 150 170 190 210 230 250 270 290
100%
50%
Cumulative R. F. Polygon(Ogive)
30
What to observe in Histograms?
Outliers: observations that stand out from the rest for some reason.Center: the “middle” of the data.Spread: the range; the extent of the data; how far the values are from each other.Shape: distribution pattern. [Skewness, symmetry, uniform, Normal, ...]
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 6
31
Symmetric (Bell) shape
Uniform Skewed to the left, or negatively skewed
Bimodal
Skewed to the right, or positively skewed
32
Histogram & Density Curve
Density function, f (x)
A smooth curve that describes the distribution
Percent
Use a mathematical model to describe the variable.
33
Stemplots (or Stem-and-leaf plots)
-- leading digits are called stems-- final digits are called leaves
34
Example: (number of hysterectomies performed by 15 male doctors)
Biostatistics Introduction & Data Presentation (Short Version)
Intro - 9
49
Incorrect and Misleading Chart
50
Type of Statistical Studies
Observational Study: conditions to which subjects are exposed are not controlled by the investigator. (no attempt is made to control or influence the variables of interest)
Experimental (Controlled) Study: conditions to which subjects are exposed to are controlled by the investigator. (treatments are used in order to observe the response) (Randomization, Replications)
51
Drug A: 44/100 = 44%Drug B: 29/100 = 29%
20012773Total
1007129Drug B
1005644Drug A
TotalNoYesTreatment
Hypertension
Results from observing behavior and outcomes from the use of medicine for
200 randomly selected patients.(Patients chose their medicine)
52
OR <65: Drug A: 5/23 = 22%Drug B: 17/77 = 22%
OR 65+: Drug A: 39/77 = 51%Drug B: 12/23 = 52%
231112776017Drug B
77383923185Drug A
TotalNoYesTotalNoYesTreatment
65+Below 65
Hypertension
* Older patients prefer Drug A
Simpson’s Paradox
53
Confounding variables
Treatment 1 &
Treatment 2
Patient’s SurvivalCause?
Patient’s Age & Health Condition
54
Confounding Effect
Variables, whether part of a study or not, are said to be confounded when their effects on the outcome cannot be distinguished from each otherAge may affect the reaction to drug and may also affect drug choosing decision.
Descriptive Statistics (Short Version)
Descriptive Stat - 1
1
Example: Birth weights (in lb) of 5 babies born from two groups of women under different care programs.
Group 1: 7, 6, 8, 7, 7Group 2: 3, 4, 8, 9, 11
6 7 8
2
Describe Distribution with Numbers
Numerical Summary Measures
• Measure of Center• Measure of Variation• Measure of Position
3
Measure of Central Tendency
Mean: the average value of the data.
If n observations are denoted by x1, x2, ..., xn, their (sample) mean is
n
x
nxxxx
n
ii
n∑==
+++= 121 ...
4
Example: Birth weights (in lb) of 5 babies born from a group of women under certain diet.
7, 6, 8, 7, 7
Sol:
[near the center of the data set]
75
355
77867==
++++=mean
5
Median: of a data set is
the data value exactly in the middle of its ordered list if the number of pieces of data is odd,the mean of the two middle data values in its ordered list if the number of pieces of data is even.
[median is not influenced by outliers and is best for non-symmetric distribution]
6
Example: (number of hysterectomies performed by 15 doctors)27, 50, 33, 25, 86, 25, 85, 31, 37, 44, 20,
Does the mother’s diet program affect the birth weights of babies?
71.021 =
Diet I: mean = 7, s = 0.71Diet II: mean = 7, s = 3.39
What is the standard deviation of the weights of babies from the sample of mothers who received diet program I?
17
If n observations are denoted by x1, x2, ..., xn, their variance and standard deviation are
Sample Variance:
(unbiased estimator for variance of an infinite population.)
Sample Standard Deviation:
Sample Mean:
1
)(1
2
2
−
−=∑=
n
xxs
n
ii
1
)(1
2
−
−=
∑=
n
xxs
n
ii
∑=
=+++=n
ii
n xnn
xxxx1
21 1...
18
If N observations are denoted by x1, x2, ..., xn, are all the observation in a finite population, their mean, μ , variance σ 2, and standard deviation, σ , are
Population Mean:
Population Variance:
Population Standard Deviation:
∑=
=+++
=n
ii
n xNN
xxx1
21 1...μ
N
xn
ii∑
=
−= 1
2
2)( μ
σ
N
xn
ii∑
=
−= 1
2)( μσ
Population Parameters
Descriptive Statistics (Short Version)
Descriptive Stat - 4
19
About s (sample standard deviation) :
s measures the spread around the mean.the larger s is, the more spread out the data are. if s = 0, then all the observations must be equal.s is strongly influenced by outliers.
20
The Use of Mean and Standard Deviation
Describe distributionUnderstand the center and the spread of the distribution
21
Bone Density Data
20118.4Male
25102.2Female
Standard Deviation, sMean,
Unit: mg/mlx
22
Many distributions can be described by a mathematical function with specific parameters, such as mean and standard deviation.
Example: Normal Distribution (Bell-shaped)
μ
σ
23
Empirical RuleProperties of a symmetric and bell-shaped (Normal) distribution:
The distribution is symmetric about it mean (μ),68% of the area between μ − σ and μ + σ , 95% of the area between μ − 2σ and μ + 2σ ,99.7% of the area between μ − 3σ and μ + 3σ .
μ − 3σ μ μ + 3σ 2466 70 74
Heart rates for a certain population at a certain condition follow a bell shape symmetric distribution with mean 70 and standard deviation 2.
What percentage of people in this population will have heart rates between 66 and 74?
?%95%
Descriptive Statistics (Short Version)
Descriptive Stat - 5
25
Chebychev’s inequality
There is at least 1 – (1/k2) of the data in a data set lie within kstandard deviation of their mean.
Chebychev’s Rule
26
Example: Heart rates for asthmatic patients in a state of respiratory arrest has a mean of 140 beats per minute and a standard deviation of 35.5 beats per minute. What percentage of the population of this type of patients have heart rates lie between two standard deviations of the mean in a state of respiratory arrest?
It will be at least 75%, because k = 2, and1 – (1/22) = ¾ = 75%.
27
69 144 211
Heart rates example: mean=144, s.d.=35.5
At least 75%
75% = 1 − (1/22)
140 - 2x35.5 = 69 140 + 2x35.5 = 211
k = 2
28
33.5 144 246.5
At least ?%
What about within three standard deviations? Heart rates example: mean=144, s.d.=35.5
?% ≈ 1 − (1/32)
144 - 3x35.5 = 33.5 144 + 3x35.5 = 246.5
k = 3
At least 89%
89% ≈ 1 − (1/32)
29
Measure of Position
Standard Score, Percentile, Quartile
30
If x is an observation from a distribution that has mean μ , and standard deviation σ , the standardized value of x is,
z-score of x :
“μ + 3σ” has a z-score 3, since it is 3 s.d. from mean.
deviationstandard mean xxz −
=−
=σμ
Z-score (Standard Score)
Population z-score
Descriptive Statistics (Short Version)
Descriptive Stat - 6
31
If a distribution has a mean 10 and a s.d. 2, the value 7 has a z-score –1.5.
z-score = (7 – 10)/2 = – 1.5.
6 8 10 12 14
1.5 s.d.
32
Sample z-score
Example: If the mean of a random sample is 5 and the standard deviation is 2, what would be the sample z-score of the value 6?
sxxz −
=
5.021
256
==−
=z
6 ,2 ,5 === xsx
33
Example: Bone Mineral Density
The WHO Working Group defines osteoporosis according to measurements of bone mineral density (BMD) using dual-energy X-ray absorptiometry (DEXA).Thus osteoporosis is defined as a bone density T score at or below 2.5 standard deviations (T score) below normal peak values for young adults.
34
These criteria were initially established for the assessment of osteoporosis in Caucasian women.
BMD reports may include a “Z score” which is the number of standard deviations by which the subject of interest differs from the mean for their age.
Severe osteoporosis T score < - 2.5 SD with 1 or more fragility fractures
OsteoporosisT score < - 2.5 SD
OsteopaeniaT score between –1.0 and –2.5 SD
Normal bone mineral density T score > -1.0 S.D
DefinitionDEXA BMD Values
35
Quartiles: (Measure of Position)
• The first quartile, Q1, or 25th percentile, is the median of the lower half of the list of ordered observations.
• The third quartile, Q3, or 75th percentile, is the median of the upper half of the list of ordered observations.
Measure of spread:Interquartile range (IQR) = Q3 − Q1
IQR = 72 - 64 = 8
6864 72
38
The five-number summary
.Minimum value
.Q1
.Median
.Q3
.Maximum value
39
Example: (data sheet without outlier “6”)60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75
Min = 60, Q1 = 64.5, Median = 69, Q3 = 72, Max = 75.
21N =
HEIGHT
80
70
60
50
4022N =
HEIGHT
80
60
40
20
0
1
With 6 in the data:6, 60,61,63,64,64,65,65,65,66,67,69,71,71,71,72,72,72,72,73,74,75
Q1 = 64 Median = 68 Q3 = 72
IQR = 72 - 64 = 8
41
Inner and outer fences for outliers• The inner fences are located at
a distance of 1.5 IQR below Q1(lower inner fence = Q1 - 1.5 x IQR )
and at a distance of 1.5 IQR above Q3 (upper inner fence = Q3 + 1.5 x IQR ).
• The outer fences are located at a distance of 3 IQR below Q1
(lower outer fence = Q1 – 3 x IQR )and at a distance of 3 IQR above Q3
(upper outer fence = Q3 + 3 x IQR ) .42
• The inner fences are located at a distance of 1.5 IQR below Q1
(lower inner fence = 64 - 1.5 x 8 = 52 ) and at a distance of 1.5 IQR above Q3
(upper inner fence = 72 + 1.5 x 8 = 84). • The outer fences are located at
a distance of 3 IQR below Q1(lower outer fence = 64 – 3 x 8 = 40)
and at a distance of 3 IQR above Q3(upper outer fence = 72 + 3 x 8 = 96) .
IQR = 72 – 64 = 8; Q1 = 64; Q3 = 72
Descriptive Statistics (Short Version)
Descriptive Stat - 8
43
22N =
HEIGHT
80
60
40
20
0
1
IQR
LIF: 64 - 1.5 x 8 = 5252
UIF: 72 + 1.5 x 8 = 8484
Q1 = 64; Q3 = 72; IQR = 72 – 64 = 8
Inner fence
Inner fence
44
22N =
HEIGHT
80
60
40
20
0
1
IQR
Outer fence
Inner fence
Inner fence
Outer fence
LOF: 64 - 3 x 8 = 4040
UOF:72 + 3 x 8 = 9696
45
Mild and Extreme outliers
Data values falling between the inner and outer fences are considered mild outliers.Data values falling outside the outer fences are considered extreme outliers.
When outliers exist, the whisker extended to the smallest and largest data values within the inner fence.
46
Side-by-side Box Plot
138N =
sex
MaleFemale
HE
IGH
T
80
70
60
50
21
18
19
148
3
47
Remarks:
If the distribution of the data is symmetric, then the mean and median will be about the same.The five-number summary is best for non-symmetric data. The median, quartiles, inter-quartile range are not influenced by outliers.The mean and standard deviation are most appropriate to use only if the data are symmetric because both of these measures are easily influenced by outliers.
48
Boxplot
For the following data:13 72 78 40 50 56 50 52 57 69 130 142 51 52
Find the five-number-summary & IRQMake a boxplotFind the 60th percentile.
Probability (Short Version)
Probability - 1
1
A researcher claims that 10% of a large population have disease H.
A random sample of 100 people is taken from this population and examined.
If 20 people in this random sample have the disease, what does it mean? How likelywould this happen if the researcher is right?
Probability and Counting Rules
2
Sample Space and Probability
Random Experiment: (Probability Experiment) an experiment whose outcomes depend on chance.Sample Space (S): collection of all possible outcomes in random experiment.Event (E): a collection of outcomes of interest in a random experiment.
3
Sample Space and EventSample Space:
S = {Head, Tail}S = {Life span of a human} = {x | x≥0,
x∈R}
Event:E = {Head}E = {Life span of a human is less than 3
years}4
A Simple Example
What’s the probability of getting a head on the toss of a single fair coin? Use a scale from 0 (no way) to 1(sure thing).So toss a coin twice. Do it! Did you get one head & one tail? What’s it all mean?
5
Definition of ProbabilityA rough definition: (frequentist definition)
Probability of a certain outcome to occur in a random experiment is the proportion of timesthat the this outcome would occur in a very long series of repetitions of the random experiment.
Number of TossesNumber of Tosses
Total Heads / Number of TossesTotal Heads / Number of Tosses
When selecting one individual at random from a population, the probability distribution and the relative frequency distribution are the same.
16
Probability for the Discrete CaseIf an individual is randomly selected from this group 300, what is the probability that this person visited doctor 3 times?
It would be an empirical probability distribution, if the sample of 300 individuals is utilized for understanding a large population.
18
Properties of Probability
• Probability is always a value between 0 and 1.
• Total probability (all outcomes together) equals 1.
• Probability of either one of the disjoint events A or B to occur is the sum of their individual probabilities. P(A or B) = P(A) + P(B)
Probability (Short Version)
Probability - 4
19
Complementation Rule
For any event E,
P(E does not occur) = 1 – P(E)
Complement of E = E E
* Some places use Ec or E’EP(E)
P(E)
20
Complementation Rule
If an unbalanced coin has a probability of 0.7 to turn up Head each time tossing this coin. What is the probability of not getting a Head for a random toss?
P(not getting Head) = 1 – 0.7= 0.3
21
Complementation Rule
If the chance of a randomly selected individual living in community A to have disease H is .001, what is the probability that this person does not have disease H?
P(having disease H) = .001P(not having disease H)
= 1 – P(having disease H) = 1 – 0.001 = 0.999
22
Birthday Problem
In a group of randomly select 23 people, what is the probability that at least two people have the same birth date? (Assume there are 365 days in a year.)P(at least two people have the same birth date)
Too hard !!!= 1 – P(everybody has different birth date)= 1 – [365x364x…x(365-23+1)] / 36523
23
Intersection of events:A ∩ B <=> A and B
Example: A ∩ B = {3}
S = {1, 2, 3, 4, 5, 6}
A = {1, 2, 3}
B = {3, 6}
1 2 3
4 5 6
A
B
S
Venn Diagram
(with elements listed)
Union of events:A ∪ B <=> A or B
Example: A ∪ B = {1, 2, 3, 6}24
Venn Diagram (with counts)
A ∩ Bn(A ∩ B) = 20
?
A=Smokers, n(A) = 50
A B
B=Lung Cancer, n(B) = 25
Joint Event
Given total of 100 subjects
3020
5
45
n(A ∪ B) = ?55
Probability (Short Version)
Probability - 5
25
Venn Diagram (with relative frequencies)
A ∩ BP(A ∩ B) = .20
?
A=Smokers, P(A) = .50
A B
B=Lung Cancer, P(B) = .25
Joint Event
Given a sample space
.3.20
.05
.45
P(A ∪ B) = .55
26
Contingency Table
1007525
50455Not Smoke, Ac
503020Smoke, A
TotalNo Cancer, BcCancer, B
A BVenn Diagram
27
Conditional ProbabilityThe conditional probability of event A to occur given event B has occurred (or given the condition B) is denoted as P(A|B) and is, if P(B) is not zero, n(E) = # of equally likely outcomes in E,
or)(
)()|(BP
BAPBAP ∩=
)()()|(
BnBAnBAP ∩
=
A B
28
Conditional Probability
1007525
50455Not SmokeS’
503020SmokeS
TotalNo CancerC’
CancerC
P(C|S' ) = 5/50 = .1P(C|S) = 20/50 = .4
)()()|(
BnBAnBAP ∩
=
29
Conditional Probability
100(1.0)
75P(C)=(.75)
25P(C) =(.25)
50P(S’) =(.5)
45(.45)
5(.05)
Not SmokeS’
50P(S) =(.5)
30(.3)
20(.2)
SmokeS
TotalNo CancerC’
CancerC
P(C|S' ) = .05/.5 = .1P(C|S) = .2/.5 = .4
What is =?P(C|S)P(C|S’) 4
(Relative Risk )
)()()|(
BPBAPBAP ∩
=
30
Independent Events
Events A and B are independent ifP(A|B) = P(A)
or P(B|A) = P(B)
or P(A and B) = P(A) · P(B)
Probability (Short Version)
Probability - 6
31
Example
If a balanced die is rolled twice, what is the probability of having two 6’s?
61 = the event of getting a 6 on the 1st trial
62 = the event of getting a 6 on the 2nd trial
P(61) = 1/6,
P(62) = 1/6, 61 and 62 are independent events
P(61 and 62) = P(61) P(62) = (1/6)(1/6) = 1/36 32
Independent Events“10%” of the people in a large population has disease H. If a random sample of two subjects was selected from this population, what is the probability that both subjects have disease H?
Hi : Event that the i-th randomly selected subject has disease H.
P(H2|H1) = P(H2) [Events are almost independent]
P(H1 ∩ H2 ) = ? P(H1) P(H2) = .1 x .1 = .01
33
Independent Events
If events A1, A2, …, Ak are independent, then
P(A1 and A2 and … and Ak)= P(A1) · P(A2) · … · P(Ak)
What is the probability of getting all heads in tossing a balanced coin four times experiment?P(H1) · P(H2) · P(H3) · P(H4) = (.5)4 = .0625
34
Binomial ProbabilityWhat is the probability of getting two 6’s in casting a balanced die 5 times experiment? P(S∩S∩S’∩S’∩S’) = (1/6)2 x (5/6)3
P(S) = 50% of the subjects smokedP(C|S) = 40% of the smokers have cancer
Multiplication Rule 2
P(C and S) = P(C|S) P(S) = .4 x .5 = .2
If in the population, 50% of the people smoked, and 40% of the smokers have lung cancer, what percentage of the population that are smoker and have lung cancer?