Introduction and descriptive statistics 29th August 2007 Tron Anders Moger.

Introduction and descriptive statistics

29th August 2007

Tron Anders Moger

New England Journal of Medicine, Editorial, Jan. 6,

2000, p. 42-49• The eleven most important

developments in medicine in the past millennium– Elucidation of human anatomy and physiology

– Discovery of cells and their substructures

– Elucidation of the chemistry of life

– Application of statistics to medicine

– Development of anesthesia

– Discovery of the relation of microbes to disease

– Elucidation of inheritance and genetics

– Knowledge of the immune system

– Development of body imaging

– Discovery of antimicrobial agents

– Development of molecular pharmacotherapy

Introduction

• A lot of knowledge appear through numbers and quantitative data.

• Problems in interpreting statistical results are often underestimated.

• Important to learn “numerical literacy” – the ability to understand numbers and quantitative relationships.

Number of births in former East Germany

År

19981996199419921990198819861984

Ant

all f

øds

ler

per

år

240000

200000

160000

120000

80000

40000

0

Mortality in Tanzania and Norway

Aldersgruppe (år)

55-59

50-54

45-49

40-44

35-39

30-34

25-29

20-24

15-19

Dø

delig

het

per

1000

kvi

nner

per

år

25

20

15

10

5

0

Tanzania

Norge

Research and numbers

• Numbers often appear in medical research.

• The numbers are often uncertain, they have variability

• They must be organized in order to interpret them

• Wish to generalize the results to the general population

Statistical data

Appear from:

• Numerical measurements with an instrument on a continuous scale (Continuous data). Examples:– Fever: 39.6 (Unproblematic)

– IQ: 116 (Problematic)

• Categorization (categorical data). Examples:– Man / woman (Unproblematic)

– Depressed / Not depressed (Problematic)

• Reliability: Precision of data? How much will they differ if the measurements are repeated?

• Validity: Do we capture what we are really interested in? Is the measurement relevant?

Variability in the data

Reliability of lung function

measurements6 repeated measurements on 12

students.

Student nummer

8572665652514846312632

PE

F (

l per

min

)

800

700

600

500

400

300

200

100

0

Reliability of questionnaire/interview

• Alcohol use (men 31-50 years):– Mean number of times alcohol users

say that they have felt intoxicated:

• 1993 (questionnaire): 14.1 times per year

• 1994 (interview): 7.3 times per year

In 1994 they used the word drunk.

Reliability of clinical study

• Sackett et al: Clinical Epidemiology (Little, Brown and Company, 1985). Pictures of the eye of 100 patients are studied by two clinicians to see if there is evidence of retinopathy

Second clinician

No Yes

First No: 46 10

clinician Yes: 12 32

Observed agreement:

(46+32)/100 =78%

Sources of variation in data

• Laboratory variation

• Observer variation

• Instrument variation

• Measurement variation

• Biological variation between individuals

• Day to day variation within the same individual/hospital

Generalization

• Sample: The units, experiments, individuals etc. that are in the study E.g.:– 15 patients with migraine

– Neurophysiological study on rats

• Population: The collection of units etc. one wishes the results to apply for– All patients with migraine

– All repetitions of the neurophysiological experiment

Pairs of terms

• Sample– Histogram

– Mean

– Proportion

– Measurements of cholesterol level

– Weather

• Population– Probability

distribution

– Expectation

– Risk

– Cholesterol level in the population

− Climate

Types of data:

• Continuous data. Data measured on a continuous scale, e.g. height, weight, age. Can be truly continuous (with decimals) or discrete (integers)

• Categorical data. Data in categories, e.g. gender, education level, grouped age, hospital department. Can be nominal or ordinal.

Data in SPSS (and other statistical software):

• IMPORTANT: One line in the data file always correspond to one observation!

• Common to have an id variable for each observation

• If a measurement is missing, leave the cell empty

• To create a new variable in SPSS, choose Data->Insert variable in the Data View window, or by writing the variable name in Name in the Variable View window

Data coding:

• The value of the variable for continuous data

• For categorical data, define a suitable coding, e.g. 0=male and 1=female, or 0=grammar schoole, 1=high school and 2=college/university degree

• In Variable View, the definition of the coding can be defined in Values

• In Label you can write further information about the variable

Descriptive statistics

• Tables

• Graphs, plots

• Measures of central tendency

• Measures of variability

Types of graphs

• Histogram

• Box-plot

• Scatter plot

• Line plot

• Bar plot

The age of 100 medical students

24 21 22 26 2622 21 19 23 2120 24 27 19 3024 22 21 22 2019 23 20 20 2321 22 22 21 2024 22 22 22 2321 23 19 20 2320 25 26 22 2122 20 22 21 2020 19 19 23 2322 20 21 22 1921 22 20 23 2222 21 20 19 2426 22 19 21 2422 23 22 19 2121 24 21 19 3931 21 18 24 2122 23 19 26 3222 21 23 19 28

How can you get an overview of these data in

SPSS? Explore!• Choose Analyze - Descriptive

Statistics - Explore. Select the relevant variables by clicking them, and transferring them to Dependent List. Choose Plots, remove the check on “Stem and leaf” and check “Histogram” instead. Click Continue and OK.

Histogram: The distribution of age among the students

(n=100)

20,00 25,00 30,00 35,00 40,00

Alder til medisinerstudentene

0

10

20

30

40

An

tall

stu

den

ter

Studenter fra Med.Fak, kull H98.

Box-plot: The distribution of age among the students

Alder til medisinerstudentene

15,00

20,00

25,00

30,00

35,00

40,00

100

831899

97

Measures of central tendency

• Mean

The students: 22.2 years

• Median

The middle observation when the observations are arranged in increasing order


• The mean is influenced by extreme observation. The median is robust

xx x x

nn 1 2 ....

Measures of variability

• Standard deviation


• Coefficient of variation: s/ *100%

The students: 13.8%

• Quartiles: Arrange the data in increasing order. The 25% quartile is at the observation where 25% of the observations have lower values, and 75% of the observations have higher values. (In SPSS: Check Percentiles in the Statistics meny in Explore)

The students:25% quartile: 20.0 years75% quartile: 23.0 years

sx x

n

ii

n

( )2

1

1

x

How to get separate plots for each category

of a categorical variable, e.g. gender

• Click Analyze - Descriptive Statistics - Explore. Move the continuous variable to Dependent List.

• Move gender to Factor List

• That’s it!

Separate boxplots for each gender

Kvinne Mann

Kjønn

15,00

20,00

25,00

30,00

35,00

40,00

Ald

er ti

l med

isin

erst

uden

tene

100

831899

97

What if you only want to study women? Select

Cases!• Choose Data->Select cases.

Check If condition is satisfied, and click the If-button

• In the new window, write gender=1 (if women are coded as 1, and if gender is the name of the variable)

• Click Continue

What if you have a continuous variable, and want to

transform it to a categorical variable? Recode!

• Sometimes you have measurements on a continuous scale, but really want to use a categorical scale during analysis (E.g. measurements from 0-20, where 0-10 implies low risk, 10-15 medium risk, 15-20 high risk)

• Velg Transform->Recode->Into different variables

• Type in name of new variable under Output variable.

• Click Old and New Values, and specify the ranges for the new variable

Relationship between two continuous variables: Scatter

plot!

• Choose Graphs - Scatter - Define. Choose a variable for the Y-axis and one for the X-axis

• Separate markers for separate groups is achieved by transferring the categorical variable to Set Markers by

• Can also include regression lines by choosing “Fit line at total”, or a line for each category by choosing “Fit line at subgroups”.

• Scatter plot, weight versus height for the students

Høyde (cm)

200190180170160150

Vek

t (k

g)

100

90

80

70

60

50

40

• Scatter plot, weight versus height, with regression lines

• Will talk much more about regression later

Høyde (cm)

200190180170160150

Vek

t (k

g)

100

90

80

70

60

50

40

Kjønn

mann

kvinne

Correlation coefficient

• A numerical measure of the relationship between two continuous variables X and Y

• Range between -1 and 1

• Values close to 0: No relationship

• Values close to 1 or -1: Almost linear relationship

yx

n

iii

yx ss

yyxxn

ss

yxCovr

1

))(()1/(1),(

What if you want to construct a new variable, which is a function of the

old ones?

• Have height and weight, want a variable for BMI

• Choose Transform->Compute. Write the name of the new variable under Target variable.

• Under Numeric expression, write (weight)/(height/100)2

• (Height was measured in cm in the example, hence divide by 100)

Descriptive statistics for categorical variables

• Not very useful to calculate the mean for e.g. educational level

• Would like to find the percentages within each category in the study

• Analyze->Descriptive Statistics ->Frequencies

• Move the variable to Variables(s)

Frequency table Educational Level (years)

Frequency Percent Valid Percent Cumulative

Percent 8 53 11,2 11,2 11,2 12 190 40,1 40,1 51,3 14 6 1,3 1,3 52,5 15 116 24,5 24,5 77,0 16 59 12,4 12,4 89,5 17 11 2,3 2,3 91,8 18 9 1,9 1,9 93,7 19 27 5,7 5,7 99,4 20 2 ,4 ,4 99,8 21 1 ,2 ,2 100,0

Valid

Total 474 100,0 100,0

Last column shows the cumulative distribution; always sums up to 100%

Simple bar plot

Relationships between categorical variables

• Choose Analyze->Descriptive Statistics ->Crosstabs

• Move one variable to Rows, and another to Columns

• Click Cells, and check relevant percentages (Rows, Columns or Total)

Crosstable: Relationship between race and

smoking

race * smoking status Crosstabulation

smoking status Non-smoking Smoking Total

Count 44 52 96 white % within race 45,8% 54,2% 100,0% Count 16 10 26 black % within race 61,5% 38,5% 100,0% Count 55 12 67

race

other % within race 82,1% 17,9% 100,0%

Total Count 115 74 189 % within race 60,8% 39,2% 100,0%

Bar plot: Relationship between race and

smoking

Line plot for ordinal categorical variables

(time-series plot)

Conclusion

• Tons of different options on how to present results

• You will (hopefully) learn to understand which option is most relevant for each problem during this course

Introduction and descriptive statistics 29th August 2007 Tron Anders Moger.

Documents

quantitative data

populationclimatetypes

precision of data

day variation

sources of variation

numerical measurements

forall patients

statistical results