Jan 11, 2016
Simple statistics for clinicians on respiratory research
ByGiovanni Sotgiu
Hygiene and Preventive Medicine Institute
University of Sassari Medical School
Italy
What are your expectations?
Too difficult to explain medical statistics in 30 min…..
What is medical statistics?
• “..Discipline concerned with the treatment of numerical data derived from groups of individuals..” P Armitage
• “..Art of dealing with variation in data through collection, classification and analysis in such a way as to obtain reliable results..” JM Last
What is medical statistics?
What is medical statistics?
Collection of statistical procedures
well-suited to the analysis of healthcare-related data
Why we need to study statistics in the field of medicine……..
1) Basic requirement of medical research
2)Update your medical knowledge
3)Data management and treatment
Why we need to study statistics…
1) Basic concepts
2) Sample and population
3)Probability
4) Data description
5) Measures of disease
Road map
Basic concepts
Basic concepts
All individuals have similar values or belong to the same category
Ex.: all individuals are Chinese,
….women,
….middle age (30~40 years old),
….work in the same factory
homogeneity in nationality, gender, age and occupation
1. Homogeneity
Basic concepts
Differences in height, weight, treatment…
1. Variation
• Toss a coin The mark face may be up or down
• Treat the patients suffering from TB with the same antibiotics: a part of them recovered and others didn’t
1. Variation
no variation, no statistics
1. Variation
What is the target of our studies?
Population
the whole collection of individuals that one intends to study
2. Population
economic issues
short time
2. Population
2. Population and sample
a representative part of the population
2. Sample
Sampling
By chance!
Random • Random event
the event may occur or may not occur in one experiment
before one experiment, nobody is sure whether the event occurs or not
Random
Please, give some examples of random event…
The mathematical procedures whereby we convert information about the sample into
intelligent guesses about the population fall under the section of inferential
Statistics (generalization)
Probability
3. Probability
Measure the possibility of occurrence of a random event
P(A) = The Number Of Ways Event A Can Occur The total number Of Possible Outcomes
Number of observations: n (large enough)
Number of occurrences of random event A: m
P(A) m/n relative frequency theory
Estimation of Probability Frequency
3. Probability
A random event
P(A) Probability of the random event A
P(A)1 , if an event always occurs
P(A)0, if an event never occurs
Please, give some examples for probability of a random event and frequency of
that random event
Parameters and statistics
4. Parameter
A measurement describing some characteristic of a population
or
A measurement of the distribution of a characteristic of a population
Greek letter (μ,π, etc.)
Usually unknown
to know the parameter of a population
we need a sample
A measurement describing some characteristic of a sample
or
A measurement of the distribution of a characteristic of a sample
Latin letter (s, p, etc.)
4. Statistic
Please give an example for parameter and statistics
Does a parameter vary?
Does a statistic vary?
4. Statistic
Sampling Error
5. Sampling Error
Difference between observed value and true value
5. Sampling Error
1) Systematic error (fixed)
2) Measurement error (random)
3) Sampling error (random)
Sampling error
• The statistics different from the parameter!
• The statistics of different samples from same population different each other!
Sampling error
The sampling error exists in any sampling research
It can not be avoided but may be estimated
Nature of data
Variables and data
• Variables are labels whose value can literally vary
• Data is the value you get from observing
measuring, counting, assessing etc.
Data
Data
Categorical Data
Metric Data
Nominal Data
Ordinal Data
Discrete Data
Continuous Data
Nominal or categorical data
• It can be allocated into one of a number of categories
• Blood type, sex, Linezolid treatment (y/n)
• Data cannot be arranged in an ordering scheme
Ordinal categorical data
• It can be allocated to one of a number of categories but it has to be put in meaningful order
• Differences cannot be determined or are meaningless
• Very satisfied, satisfied, neutral, unsatisfied, very unsatisfied (new treatment)
Discrete metric data
• Countable variables number of possible values is a finite number
• Numbers of days of hospitalization
• Numbers of men treated with isoniazid
Continuous metric data
• Measurable variables
• Infinitely many possible values continuous scale covering a range of values without gaps
• Kg, m, mmHg, years
Describing data…..
with tables
Describing data with tables
1) actual frequency
2) relative and cumulative frequency
3) grouped frequency
4) open- ended groups
5) cross-tabulation
1) Frequency table
Frequency distribution
TB mortality (%) Tally No. of wards11.2-15.1 1, 1, 1, 1, 1, 1, 1, 1, 1 9
15.2-20.1 1, 1, 1, 1, 1, 1, 1, 1 8
20.2-25.1 1, 1, 1, 1, 1 5
25.2-30.1 1, 1, 1 3
30.2-35.1 1, 1
variables frequency
2) Relative frequency, cumulative frequency
Relative frequency proportion of the total
No. of resistances No. of patientsRelative frequency
(%)Cumulative frequency
(%)
0 5 12.5 12.5
1 6 15 27.5
2 14 35 62.5
3 10 25 87.5
4 3 7.5 95
7 1 2.5 97.5
8 1 2.5 100
3) Grouped frequencyGrouped frequency works for continuous metric data
Birth weight No. of infants born from mothers with TB
2700-2999 2
3000-3299 3
3300-3599 9
3600-3899 9
3900-4199 4
4200-4499 3
A group width of 300g
The class lower limit
The class upper limit
General rules
• Frequency table
nominal, ordinal and discrete metric data
• Grouped frequency table
continuous metric data
4) Open-ended group
• One or more values which are called outliers, long away from the general mass of the data
• Use ≤ or ≥
5) Cross-tabulation
• Two variables within a single group of individuals
Pulmonary mass
TB/HIV+Totals
Yes No
Benign 21 11 32
Malignant 4 4 8
Totals 25 1540
Describing data…..
with charts
3. Describing data with charts1) Charting nominal data
a) pie chartb) simple bar chartc) cluster bar chartd) stacked bar chart
2) Charting ordinal data
1) pie chart2) bar chart3) dotplot
3) Charting discrete metric data
4) Charting continuous metric data
histogram
5) Charting cumulative ordinal or discrete metric data
step chart
6) Charting cumulative metric continuous data
cumulative frequency or ogive
7) Charting time based
time –series chart
1-a) Pie chart• 4-5 categories• One variable• Start at 0° in the same order as the table
Adverse events of ethionamide
1-b) Simple bar chart
• Same widths, equal spaces b/w bars
n
1-c) Clustered bar chart
1-d) Stacked bar chart
2-3) Dot-plot
Useful with ordinal variables if the number of categories is too large for a bar chart
4) Histogram
Percentage of age distribution of pregnant TB women
0
5
10
15
20
25
30
35
40
<19 20-24 25-29 30-34 >35
TB cases
%
6) Cumulative frequency curve
0
20
40
60
80
100
15-24
25-34
35-44
45-54
55-64
65-74
75-84
> 85
Percentage of cumulative frequency curves of age for males and females who develop TB
Describing data from its distributional shape
Describing data from its distributional shape
Symmetric mound-shaped distributions
Skewed distributions
0
20
40
60
80
100
120
140
160
15-
24
25-
34
35-
44
45-
54
55-
64
65-
74
75-
84
>
85
Age distribution for migrants who develop TB
Bimodal distributions
A bimodal distribution is one with two distinct humps
Normal-ness
• Symmetric
• Same mean, median, mode
Describing data with numeric summary value
Describing data with numeric summary value
• 1. numbers, proportions (percentages)
• 2. summary measures of location
• 3. summary measures of spread
Numbers and proportions
• Numbers actual frequencies• Percentage is a proportion multiplied by 100
1)Prevalence
2) Incidence
Prevalence
-nature relative frequency
number of existing cases in some population at a given time
t0
disease health
Prevalence
No. of existing cases of a disease at t0
= 0…..1
total population
A (N=6) B (N=4)
fa=1 fa=1
No comparison
fr=0.17 fr=0.25
Comparison
Disease Health
Prevalence
P = = 0
P = = 0.25
P = = 1
Disease Health
Prevalence
Prevalence data:
- Highlight the time of the evaluation
Example:
P (2010)= 0.17
P (2010)= 17 per 100 individuals
Incidence
estimates the risk of developing disease
t0 t1People at risk (healthy)
Disease Health
No. of new cases during given t0- t1
total population at risk
Incidence
- Measures the probability or risk of developing disease during given time period
- Absolute risk probabilityof developing an adverse event
Incidence
-Assess the health status at baseline
esclude prevalent cases at t0
-Define a follow-up for the cohort
Healthy people followed-up for a given time period
Cohort
Closed Populationadds no new members over time, and loses members only to disease/death
Open Populationmay gain members over time, through immigration or birth, or lose members through emigration
Cumulative incidence
- Closed population
- Individual time period at risk same period for all the members
A >
B >
C >
D >
E >
t0 t1
time
P
e
o
p
l
e
0 3
No. of new cases during given t0- t1
total population at risk
Cumulative incidence
Example: t0 = 24; new cases= 3; follow-up = 3 years
CI in 3 years = 0.125 new cases per 1 individual at risk enrolled at t0
12.5 new cases in 100 individuals at risk enrolled at t0
t0 t1
time
P
e
o
p
l
e
0 3
Cumulative incidence
- Closed popularion rare
- Short follow-up and enrollment of a few individuals
- Open population
Cumulative incidence…critical features
Open population
-Non cases (drop-out) and cases during the follow-up
- Enrollment of new individuals during the follow-up
- Length of follow-up not uniform
A >
B >
D >
F >
H >
t0 t1time
P
e
o
p
l
eG >
I >
Drop-out Case
C >
E >
Open population
Coorte dinamica
Individual time period at risk not uniform
Estimate the population at risk:
- Total person-time
- Estimate of the total person-time
Coorte dinamica
Total person-time individual time period at risk
Person-time: days-, months-, years
Density of incidence
No. of new cases during given t0- t1
total person-time
1 (A) 5 1 person x 5 years 5 person-years
3 (B, C, D) 2 3 person x 2 years 6 person-years
2 (E, F) 2.5 2 person x 2.5 years 5 person-years
2 (G, H) 1.5 2 person x 1.5 years 3 person-years
1 (I) 3 1 person x 3 years 3 person-years
N Individual time period at
risk Person-years
Total person-time 22 person-years
Person-years
Density of incidence
1 new case
22 person-years
0,045 new cases
=
1 person-years
= 0,045
45 per 1000 person-years
Density of incidence
Open population
Estimate of the total person-time
Individual time period at risk not known for all
-Migration
Movement of the cohort in the middle of the follow-up
Estimate of the total person-time
(P0 + Pt)/2 x follow-up
At t0: 100 people
Follow-up: 3 years
New cases: 3
Drop-out: 17
Enrollment during the follow-up: 16
>>>P0 = 100; Pt = (100-3-17+16) = 96
(P0 + Pt)/2 x follow-up
(100 + 96)/2 x 3 = 294 person-years
Estimate of the total person-time
Test the estimate:
80 people x 3 years = 240 person-years
Movement of the cohort
(17 x 1.5) + (3 x 1.5) + (16 x 1.5) = 54 person-years
240 + 54 = 294 person-years
At t0: 100 people
Follow-up: 3 years
New cases: 3
Drop-out: 17
Enrollment during the follow-up: 16
Estimate of the total person-time
Incidence rate
3 new cases/ 294 person-years x 1000 = 10.2
No. of new cases during given t0- t1
estimate of total person-time
Summary measures of location
1) mode: category or value occurs the most often, typical-ness.
Categorical, metric discrete
2) median: middle value in ascending order, central-ness.ordinal and metric data
3) mean (average): divide the sum of the values by the number of values
4) percentile: divide the total number of the values into 100 equal-sized groups.
Choosing the most appropriate measure
Mode Median Mean
Nominal yes no no
Ordinal yes yes no
Metric
discrete
yes Yes, when markedly skewed
yes
Metric
continuous
yes Yes, when markedly skewed
yes
Summary measure of spread
• Rangedistance from the smallest value to the largest
• IQR (interquartile range)spread of the middle half of the values
• Boxplot graphical summary of the three quartile values,
the minimum and maximum values, and outliers.
Standard deviation
• Average distance of all the data values from the mean value
• The smaller the average distance is, the narrower the spread, and vice versa
• Used metric data only
1. Subtract the mean from each of the n value in the sample, to give the different values
2. Square each of these differences
3. Add these squared values together (sum of squares)
4. Divide the sum of squares by 1 less than the sample size. (n-1)
5. Take the square-root
Standard deviation and the normal distribution
The Basic Steps of Statistical Work
1. Design of study1. Design of study
Professional design: Research aim
Subjects,
Measures, etc.
• Statistical design: Sampling or allocation method, Sample size,
Randomization, Data processing, etc.
2. Collection of data
• Source of data Government report system
Registration system
Routine records
Ad hoc survey
• Data collection accuracy, complete, in time
Protocol: Place, subjects, timing;
training; pilot; questionnaire; instruments; sampling method and
sample size; budget Procedure: observation, interview filling form, letter telephone, web
3. Data Sorting
• Checking Hand, computer software • Amend• Missing data?• Grouping According to categorical variables (sex,
occupation, disease…) According to numerical variables (age, income,
blood pressure …)
4. Data Analysis
• Descriptive statistics (show the sample) mean, incidence rate … -- Table and plot• Inferential statistics (towards the
population) -- Estimation Hypothesis test (comparison)
Definition of Selection Bias
Selection bias: Selection biases are distortions that result from procedures used to select subjects and from factors that influence study participation. The common element of such biases is that the association between exposure and disease is different for those who participate and those who should be theoretically eligible for study, including those who do not participate.
Definition of Selection Bias
It is sometimes (but not always) possible to disentangle the effects of participation from those of disease determinants using standard methods for the control of confounding. One example is the bias introduced by matching in case-control studies.
Definition of Confounding
Confounding: bias in estimating an epidemiologic measure of effect resulting from an imbalance of other causes of disease in the compared groups.(mixing of effects)
Characteristics of a Confounder
• associated with disease (in non-exposed)
• associated with exposure (in source population)
• not an intermediate cause