Simple statistics for clinicians on respiratory research By Giovanni Sotgiu Hygiene and Preventive Medicine Institute University of Sassari Medical School.

Simple statistics for clinicians on respiratory research

ByGiovanni Sotgiu

Hygiene and Preventive Medicine Institute

University of Sassari Medical School

Italy

What are your expectations?

Too difficult to explain medical statistics in 30 min…..

What is medical statistics?

• “..Discipline concerned with the treatment of numerical data derived from groups of individuals..” P Armitage

• “..Art of dealing with variation in data through collection, classification and analysis in such a way as to obtain reliable results..” JM Last



Collection of statistical procedures

well-suited to the analysis of healthcare-related data

Why we need to study statistics in the field of medicine……..

1) Basic requirement of medical research

2)Update your medical knowledge

3)Data management and treatment

Why we need to study statistics…

1) Basic concepts

2) Sample and population

3)Probability

4) Data description

5) Measures of disease

Road map

Basic concepts

Basic concepts

All individuals have similar values or belong to the same category

Ex.: all individuals are Chinese,

….women,

….middle age (30~40 years old),

….work in the same factory

homogeneity in nationality, gender, age and occupation

1. Homogeneity

Basic concepts

Differences in height, weight, treatment…

1. Variation

• Toss a coin The mark face may be up or down

• Treat the patients suffering from TB with the same antibiotics: a part of them recovered and others didn’t

1. Variation

no variation, no statistics

1. Variation

What is the target of our studies?

Population

the whole collection of individuals that one intends to study

2. Population

economic issues

short time

2. Population

2. Population and sample

a representative part of the population

2. Sample

Sampling

By chance!

Random • Random event

the event may occur or may not occur in one experiment

before one experiment, nobody is sure whether the event occurs or not

Random

Please, give some examples of random event…

The mathematical procedures whereby we convert information about the sample into

intelligent guesses about the population fall under the section of inferential

Statistics (generalization)

Probability

3. Probability

Measure the possibility of occurrence of a random event

P(A) = The Number Of Ways Event A Can Occur The total number Of Possible Outcomes

Number of observations: n (large enough)

Number of occurrences of random event A: m

P(A) m/n relative frequency theory

Estimation of Probability Frequency

3. Probability

A random event

P(A) Probability of the random event A

P(A)1 , if an event always occurs

P(A)0, if an event never occurs

Please, give some examples for probability of a random event and frequency of

that random event

Parameters and statistics

4. Parameter

A measurement describing some characteristic of a population

or

A measurement of the distribution of a characteristic of a population

Greek letter (μ,π, etc.)

Usually unknown

to know the parameter of a population

we need a sample

A measurement describing some characteristic of a sample

or

A measurement of the distribution of a characteristic of a sample

Latin letter (s, p, etc.)

4. Statistic

Please give an example for parameter and statistics

Does a parameter vary?

Does a statistic vary?

4. Statistic

Sampling Error

5. Sampling Error

Difference between observed value and true value

5. Sampling Error

1) Systematic error (fixed)

2) Measurement error (random)

3) Sampling error (random)

Sampling error

• The statistics different from the parameter!

• The statistics of different samples from same population different each other!

Sampling error

The sampling error exists in any sampling research

It can not be avoided but may be estimated

Nature of data

Variables and data

• Variables are labels whose value can literally vary

• Data is the value you get from observing

measuring, counting, assessing etc.

Data

Data

Categorical Data

Metric Data

Nominal Data

Ordinal Data

Discrete Data

Continuous Data

Nominal or categorical data

• It can be allocated into one of a number of categories

• Blood type, sex, Linezolid treatment (y/n)

• Data cannot be arranged in an ordering scheme

Ordinal categorical data

• It can be allocated to one of a number of categories but it has to be put in meaningful order

• Differences cannot be determined or are meaningless

• Very satisfied, satisfied, neutral, unsatisfied, very unsatisfied (new treatment)

Discrete metric data

• Countable variables number of possible values is a finite number

• Numbers of days of hospitalization

• Numbers of men treated with isoniazid

Continuous metric data

• Measurable variables

• Infinitely many possible values continuous scale covering a range of values without gaps

• Kg, m, mmHg, years

Describing data…..

with tables

Describing data with tables

1) actual frequency

2) relative and cumulative frequency

3) grouped frequency

4) open- ended groups

5) cross-tabulation

1) Frequency table

Frequency distribution

TB mortality (%) Tally No. of wards11.2-15.1 1, 1, 1, 1, 1, 1, 1, 1, 1 9

15.2-20.1 1, 1, 1, 1, 1, 1, 1, 1 8

20.2-25.1 1, 1, 1, 1, 1 5

25.2-30.1 1, 1, 1 3

30.2-35.1 1, 1

variables frequency

2) Relative frequency, cumulative frequency

Relative frequency proportion of the total

No. of resistances No. of patientsRelative frequency

(%)Cumulative frequency

(%)

0 5 12.5 12.5

1 6 15 27.5

2 14 35 62.5

3 10 25 87.5

4 3 7.5 95

7 1 2.5 97.5

8 1 2.5 100

3) Grouped frequencyGrouped frequency works for continuous metric data

Birth weight No. of infants born from mothers with TB

2700-2999 2

3000-3299 3

3300-3599 9

3600-3899 9

3900-4199 4

4200-4499 3

A group width of 300g

The class lower limit

The class upper limit

General rules

• Frequency table

nominal, ordinal and discrete metric data

• Grouped frequency table

continuous metric data

4) Open-ended group

• One or more values which are called outliers, long away from the general mass of the data

• Use ≤ or ≥

5) Cross-tabulation

• Two variables within a single group of individuals

Pulmonary mass

TB/HIV+Totals

Yes No

Benign 21 11 32

Malignant 4 4 8

Totals 25 1540

Describing data…..

with charts

3. Describing data with charts1) Charting nominal data

a) pie chartb) simple bar chartc) cluster bar chartd) stacked bar chart

2) Charting ordinal data

1) pie chart2) bar chart3) dotplot

3) Charting discrete metric data

4) Charting continuous metric data

histogram

5) Charting cumulative ordinal or discrete metric data

step chart

6) Charting cumulative metric continuous data

cumulative frequency or ogive

7) Charting time based

time –series chart

1-a) Pie chart• 4-5 categories• One variable• Start at 0° in the same order as the table

Adverse events of ethionamide

1-b) Simple bar chart

• Same widths, equal spaces b/w bars

n

1-c) Clustered bar chart

1-d) Stacked bar chart

2-3) Dot-plot

Useful with ordinal variables if the number of categories is too large for a bar chart

4) Histogram

Percentage of age distribution of pregnant TB women

0

5

10

15

20

25

30

35

40

<19 20-24 25-29 30-34 >35

TB cases

%

6) Cumulative frequency curve

0

20

40

60

80

100

15-24

25-34

35-44

45-54

55-64

65-74

75-84

> 85

Percentage of cumulative frequency curves of age for males and females who develop TB

Describing data from its distributional shape

Describing data from its distributional shape

Symmetric mound-shaped distributions

Skewed distributions

0

20

40

60

80

100

120

140

160

15-

24

25-

34

35-

44

45-

54

55-

64

65-

74

75-

84

>

85

Age distribution for migrants who develop TB

Bimodal distributions

A bimodal distribution is one with two distinct humps

Normal-ness

• Symmetric

• Same mean, median, mode

Describing data with numeric summary value

Describing data with numeric summary value

• 1. numbers, proportions (percentages)

• 2. summary measures of location

• 3. summary measures of spread

Numbers and proportions

• Numbers actual frequencies• Percentage is a proportion multiplied by 100

1)Prevalence

2) Incidence

Prevalence

-nature relative frequency

number of existing cases in some population at a given time

t0

disease health

Prevalence

No. of existing cases of a disease at t0

= 0…..1

total population

A (N=6) B (N=4)

fa=1 fa=1

No comparison

fr=0.17 fr=0.25

Comparison

Disease Health

Prevalence

P = = 0

P = = 0.25

P = = 1

Disease Health

Prevalence

Prevalence data:

- Highlight the time of the evaluation

Example:

P (2010)= 0.17

P (2010)= 17 per 100 individuals

Incidence

estimates the risk of developing disease

t0 t1People at risk (healthy)

Disease Health

No. of new cases during given t0- t1

total population at risk

Incidence

- Measures the probability or risk of developing disease during given time period

- Absolute risk probabilityof developing an adverse event

Incidence

-Assess the health status at baseline

esclude prevalent cases at t0

-Define a follow-up for the cohort

Healthy people followed-up for a given time period

Cohort

Closed Populationadds no new members over time, and loses members only to disease/death

Open Populationmay gain members over time, through immigration or birth, or lose members through emigration

Cumulative incidence

- Closed population

- Individual time period at risk same period for all the members

A >

B >

C >

D >

E >

t0 t1

time

P

e

o

p

l

e

0 3


total population at risk


Example: t0 = 24; new cases= 3; follow-up = 3 years

CI in 3 years = 0.125 new cases per 1 individual at risk enrolled at t0

12.5 new cases in 100 individuals at risk enrolled at t0

t0 t1

time

P

e

o

p

l

e

0 3


- Closed popularion rare

- Short follow-up and enrollment of a few individuals

- Open population

Cumulative incidence…critical features

Open population

-Non cases (drop-out) and cases during the follow-up

- Enrollment of new individuals during the follow-up

- Length of follow-up not uniform

A >

B >

D >

F >

H >

t0 t1time

P

e

o

p

l

eG >

I >

Drop-out Case

C >

E >

Open population

Coorte dinamica

Individual time period at risk not uniform

Estimate the population at risk:

- Total person-time

- Estimate of the total person-time

Coorte dinamica

Total person-time individual time period at risk

Person-time: days-, months-, years

Density of incidence


total person-time

1 (A) 5 1 person x 5 years 5 person-years

3 (B, C, D) 2 3 person x 2 years 6 person-years

2 (E, F) 2.5 2 person x 2.5 years 5 person-years

2 (G, H) 1.5 2 person x 1.5 years 3 person-years

1 (I) 3 1 person x 3 years 3 person-years

N Individual time period at

risk Person-years

Total person-time 22 person-years

Person-years


1 new case

22 person-years

0,045 new cases

=

1 person-years

= 0,045

45 per 1000 person-years


Open population

Estimate of the total person-time

Individual time period at risk not known for all

-Migration

Movement of the cohort in the middle of the follow-up


(P0 + Pt)/2 x follow-up

At t0: 100 people

Follow-up: 3 years

New cases: 3

Drop-out: 17

Enrollment during the follow-up: 16

>>>P0 = 100; Pt = (100-3-17+16) = 96

(P0 + Pt)/2 x follow-up

(100 + 96)/2 x 3 = 294 person-years


Test the estimate:

80 people x 3 years = 240 person-years

Movement of the cohort

(17 x 1.5) + (3 x 1.5) + (16 x 1.5) = 54 person-years

240 + 54 = 294 person-years

At t0: 100 people

Follow-up: 3 years

New cases: 3

Drop-out: 17

Enrollment during the follow-up: 16


Incidence rate

3 new cases/ 294 person-years x 1000 = 10.2


estimate of total person-time

Summary measures of location

1) mode: category or value occurs the most often, typical-ness.

Categorical, metric discrete

2) median: middle value in ascending order, central-ness.ordinal and metric data

3) mean (average): divide the sum of the values by the number of values

4) percentile: divide the total number of the values into 100 equal-sized groups.

Choosing the most appropriate measure

Mode Median Mean

Nominal yes no no

Ordinal yes yes no

Metric

discrete

yes Yes, when markedly skewed

yes

Metric

continuous

yes Yes, when markedly skewed

yes

Summary measure of spread

• Rangedistance from the smallest value to the largest

• IQR (interquartile range)spread of the middle half of the values

• Boxplot graphical summary of the three quartile values,

the minimum and maximum values, and outliers.

Standard deviation

• Average distance of all the data values from the mean value

• The smaller the average distance is, the narrower the spread, and vice versa

• Used metric data only

1. Subtract the mean from each of the n value in the sample, to give the different values

2. Square each of these differences

3. Add these squared values together (sum of squares)

4. Divide the sum of squares by 1 less than the sample size. (n-1)

5. Take the square-root

Standard deviation and the normal distribution

The Basic Steps of Statistical Work

1. Design of study1. Design of study

Professional design: Research aim

Subjects,

Measures, etc.

• Statistical design: Sampling or allocation method, Sample size,

Randomization, Data processing, etc.

2. Collection of data

• Source of data Government report system

Registration system

Routine records

Ad hoc survey

• Data collection accuracy, complete, in time

Protocol: Place, subjects, timing;

training; pilot; questionnaire; instruments; sampling method and

sample size; budget Procedure: observation, interview filling form, letter telephone, web

3. Data Sorting

• Checking Hand, computer software • Amend• Missing data?• Grouping According to categorical variables (sex,

occupation, disease…) According to numerical variables (age, income,

blood pressure …)

4. Data Analysis

• Descriptive statistics (show the sample) mean, incidence rate … -- Table and plot• Inferential statistics (towards the

population) -- Estimation Hypothesis test (comparison)

Definition of Selection Bias

Selection bias: Selection biases are distortions that result from procedures used to select subjects and from factors that influence study participation. The common element of such biases is that the association between exposure and disease is different for those who participate and those who should be theoretically eligible for study, including those who do not participate.

Definition of Selection Bias

It is sometimes (but not always) possible to disentangle the effects of participation from those of disease determinants using standard methods for the control of confounding. One example is the bias introduced by matching in case-control studies.

Definition of Confounding

Confounding: bias in estimating an epidemiologic measure of effect resulting from an imbalance of other causes of disease in the compared groups.(mixing of effects)

Characteristics of a Confounder

• associated with disease (in non-exposed)

• associated with exposure (in source population)

• not an intermediate cause

Simple statistics for clinicians on respiratory research By Giovanni Sotgiu Hygiene and Preventive Medicine Institute University of Sassari Medical School.

Documents

random random event

medical statistics

examples of random event

random event apa1

random eventparameters

simple statistics

number of ways event

collection of individuals