Introduction and descriptive statistics 29th August 2007 Tron Anders Moger
Jan 02, 2016
Introduction and descriptive statistics
29th August 2007
Tron Anders Moger
New England Journal of Medicine, Editorial, Jan. 6,
2000, p. 42-49• The eleven most important
developments in medicine in the past millennium– Elucidation of human anatomy and physiology
– Discovery of cells and their substructures
– Elucidation of the chemistry of life
– Application of statistics to medicine
– Development of anesthesia
– Discovery of the relation of microbes to disease
– Elucidation of inheritance and genetics
– Knowledge of the immune system
– Development of body imaging
– Discovery of antimicrobial agents
– Development of molecular pharmacotherapy
Introduction
• A lot of knowledge appear through numbers and quantitative data.
• Problems in interpreting statistical results are often underestimated.
• Important to learn “numerical literacy” – the ability to understand numbers and quantitative relationships.
Number of births in former East Germany
År
19981996199419921990198819861984
Ant
all f
øds
ler
per
år
240000
200000
160000
120000
80000
40000
0
Mortality in Tanzania and Norway
Aldersgruppe (år)
55-59
50-54
45-49
40-44
35-39
30-34
25-29
20-24
15-19
Dø
delig
het
per
1000
kvi
nner
per
år
25
20
15
10
5
0
Tanzania
Norge
Research and numbers
• Numbers often appear in medical research.
• The numbers are often uncertain, they have variability
• They must be organized in order to interpret them
• Wish to generalize the results to the general population
Statistical data
Appear from:
• Numerical measurements with an instrument on a continuous scale (Continuous data). Examples:– Fever: 39.6 (Unproblematic)
– IQ: 116 (Problematic)
• Categorization (categorical data). Examples:– Man / woman (Unproblematic)
– Depressed / Not depressed (Problematic)
• Reliability: Precision of data? How much will they differ if the measurements are repeated?
• Validity: Do we capture what we are really interested in? Is the measurement relevant?
Variability in the data
Reliability of lung function
measurements6 repeated measurements on 12
students.
Student nummer
8572665652514846312632
PE
F (
l per
min
)
800
700
600
500
400
300
200
100
0
Reliability of questionnaire/interview
• Alcohol use (men 31-50 years):– Mean number of times alcohol users
say that they have felt intoxicated:
• 1993 (questionnaire): 14.1 times per year
• 1994 (interview): 7.3 times per year
In 1994 they used the word drunk.
Reliability of clinical study
• Sackett et al: Clinical Epidemiology (Little, Brown and Company, 1985). Pictures of the eye of 100 patients are studied by two clinicians to see if there is evidence of retinopathy
Second clinician
No Yes
First No: 46 10
clinician Yes: 12 32
Observed agreement:
(46+32)/100 =78%
Sources of variation in data
• Laboratory variation
• Observer variation
• Instrument variation
• Measurement variation
• Biological variation between individuals
• Day to day variation within the same individual/hospital
Generalization
• Sample: The units, experiments, individuals etc. that are in the study E.g.:– 15 patients with migraine
– Neurophysiological study on rats
• Population: The collection of units etc. one wishes the results to apply for– All patients with migraine
– All repetitions of the neurophysiological experiment
Pairs of terms
• Sample– Histogram
– Mean
– Proportion
– Measurements of cholesterol level
– Weather
• Population– Probability
distribution
– Expectation
– Risk
– Cholesterol level in the population
− Climate
Types of data:
• Continuous data. Data measured on a continuous scale, e.g. height, weight, age. Can be truly continuous (with decimals) or discrete (integers)
• Categorical data. Data in categories, e.g. gender, education level, grouped age, hospital department. Can be nominal or ordinal.
Data in SPSS (and other statistical software):
• IMPORTANT: One line in the data file always correspond to one observation!
• Common to have an id variable for each observation
• If a measurement is missing, leave the cell empty
• To create a new variable in SPSS, choose Data->Insert variable in the Data View window, or by writing the variable name in Name in the Variable View window
Data coding:
• The value of the variable for continuous data
• For categorical data, define a suitable coding, e.g. 0=male and 1=female, or 0=grammar schoole, 1=high school and 2=college/university degree
• In Variable View, the definition of the coding can be defined in Values
• In Label you can write further information about the variable
Descriptive statistics
• Tables
• Graphs, plots
• Measures of central tendency
• Measures of variability
Types of graphs
• Histogram
• Box-plot
• Scatter plot
• Line plot
• Bar plot
The age of 100 medical students
24 21 22 26 2622 21 19 23 2120 24 27 19 3024 22 21 22 2019 23 20 20 2321 22 22 21 2024 22 22 22 2321 23 19 20 2320 25 26 22 2122 20 22 21 2020 19 19 23 2322 20 21 22 1921 22 20 23 2222 21 20 19 2426 22 19 21 2422 23 22 19 2121 24 21 19 3931 21 18 24 2122 23 19 26 3222 21 23 19 28
How can you get an overview of these data in
SPSS? Explore!• Choose Analyze - Descriptive
Statistics - Explore. Select the relevant variables by clicking them, and transferring them to Dependent List. Choose Plots, remove the check on “Stem and leaf” and check “Histogram” instead. Click Continue and OK.
Histogram: The distribution of age among the students
(n=100)
20,00 25,00 30,00 35,00 40,00
Alder til medisinerstudentene
0
10
20
30
40
An
tall
stu
den
ter
Studenter fra Med.Fak, kull H98.
Box-plot: The distribution of age among the students
Alder til medisinerstudentene
15,00
20,00
25,00
30,00
35,00
40,00
100
831899
97
Measures of central tendency
• Mean
The students: 22.2 years
• Median
The middle observation when the observations are arranged in increasing order
The students: 22.0 years
• The mean is influenced by extreme observation. The median is robust
xx x x
nn 1 2 ....
Measures of variability
• Standard deviation
The students: 3.06 years
• Coefficient of variation: s/ *100%
The students: 13.8%
• Quartiles: Arrange the data in increasing order. The 25% quartile is at the observation where 25% of the observations have lower values, and 75% of the observations have higher values. (In SPSS: Check Percentiles in the Statistics meny in Explore)
The students:25% quartile: 20.0 years75% quartile: 23.0 years
sx x
n
ii
n
( )2
1
1
x
How to get separate plots for each category
of a categorical variable, e.g. gender
• Click Analyze - Descriptive Statistics - Explore. Move the continuous variable to Dependent List.
• Move gender to Factor List
• That’s it!
Separate boxplots for each gender
Kvinne Mann
Kjønn
15,00
20,00
25,00
30,00
35,00
40,00
Ald
er ti
l med
isin
erst
uden
tene
100
831899
97
What if you only want to study women? Select
Cases!• Choose Data->Select cases.
Check If condition is satisfied, and click the If-button
• In the new window, write gender=1 (if women are coded as 1, and if gender is the name of the variable)
• Click Continue
What if you have a continuous variable, and want to
transform it to a categorical variable? Recode!
• Sometimes you have measurements on a continuous scale, but really want to use a categorical scale during analysis (E.g. measurements from 0-20, where 0-10 implies low risk, 10-15 medium risk, 15-20 high risk)
• Velg Transform->Recode->Into different variables
• Type in name of new variable under Output variable.
• Click Old and New Values, and specify the ranges for the new variable
Relationship between two continuous variables: Scatter
plot!
• Choose Graphs - Scatter - Define. Choose a variable for the Y-axis and one for the X-axis
• Separate markers for separate groups is achieved by transferring the categorical variable to Set Markers by
• Can also include regression lines by choosing “Fit line at total”, or a line for each category by choosing “Fit line at subgroups”.
• Scatter plot, weight versus height for the students
Høyde (cm)
200190180170160150
Vek
t (k
g)
100
90
80
70
60
50
40
• Scatter plot, weight versus height, with regression lines
• Will talk much more about regression later
Høyde (cm)
200190180170160150
Vek
t (k
g)
100
90
80
70
60
50
40
Kjønn
mann
kvinne
Correlation coefficient
• A numerical measure of the relationship between two continuous variables X and Y
• Range between -1 and 1
• Values close to 0: No relationship
• Values close to 1 or -1: Almost linear relationship
yx
n
iii
yx ss
yyxxn
ss
yxCovr
1
))(()1/(1),(
What if you want to construct a new variable, which is a function of the
old ones?
• Have height and weight, want a variable for BMI
• Choose Transform->Compute. Write the name of the new variable under Target variable.
• Under Numeric expression, write (weight)/(height/100)2
• (Height was measured in cm in the example, hence divide by 100)
Descriptive statistics for categorical variables
• Not very useful to calculate the mean for e.g. educational level
• Would like to find the percentages within each category in the study
• Analyze->Descriptive Statistics ->Frequencies
• Move the variable to Variables(s)
Frequency table Educational Level (years)
Frequency Percent Valid Percent Cumulative
Percent 8 53 11,2 11,2 11,2 12 190 40,1 40,1 51,3 14 6 1,3 1,3 52,5 15 116 24,5 24,5 77,0 16 59 12,4 12,4 89,5 17 11 2,3 2,3 91,8 18 9 1,9 1,9 93,7 19 27 5,7 5,7 99,4 20 2 ,4 ,4 99,8 21 1 ,2 ,2 100,0
Valid
Total 474 100,0 100,0
Last column shows the cumulative distribution; always sums up to 100%
Simple bar plot
Relationships between categorical variables
• Choose Analyze->Descriptive Statistics ->Crosstabs
• Move one variable to Rows, and another to Columns
• Click Cells, and check relevant percentages (Rows, Columns or Total)
Crosstable: Relationship between race and
smoking
race * smoking status Crosstabulation
smoking status Non-smoking Smoking Total
Count 44 52 96 white % within race 45,8% 54,2% 100,0% Count 16 10 26 black % within race 61,5% 38,5% 100,0% Count 55 12 67
race
other % within race 82,1% 17,9% 100,0%
Total Count 115 74 189 % within race 60,8% 39,2% 100,0%
Bar plot: Relationship between race and
smoking
Line plot for ordinal categorical variables
(time-series plot)
Conclusion
• Tons of different options on how to present results
• You will (hopefully) learn to understand which option is most relevant for each problem during this course