Data analysis: 1. Describing data Ana Jerončić, PhD Department for Research in Biomedicine and Health
Dec 26, 2015
Data analysis:1. Describing
dataAna Jerončić, PhD
Department for Research in Biomedicine and Health
Contents of the 2nd week
1. Describing data - Central tendency and variability
2. Estimation - Accuracy, precision, standard error, confidence intervals
3. Hypothesis testing - Test statistics, P-value, choice of a statistical test
4. Interpretation of data - Causality and association, odds ratio, risk, correlation, linear regression
5. Sources of error - Type 1 and type 2 errors, power, bias, confounding
Learning outcome
Critical appraisal of scientific papers
NOT!Implementation of data analysis
Why?
To identify the best available treatment To prevent “medical zombies” To perform your own research
Data analysis:Describing data
1. How the data should be organized prior to data analysis
2. Data types3. Graphical & tabular techniques for
description, summary statistics Qualitative Data Quantitative Data
Height measurements among 1st year medical students
157 204 184 186 197 155 169150 193 205 150 161 169 147167 159 187 173 146 179 201159 147 144 204 184 192 165146 169 198 164 182 165 173147 166 167 180 169 174 201146 151 203 171 186 179 152189 204 189 200 202 147 181145 161 173 155 203 190 164141 163 179 195 155 197 151197 141 146 202 149 197 203172 143 151 200 197 192 192160 173 187 172 177 179 188
What is the unit of measurement? How many observations per subject ?
Variables and their values, Units of observation
Entity Height(cm)
Weight(kg)
Age(years)
Sex(category)
Person 1Person 2Person 3**
176171182**
706075**
333862**
MaleFemaleMale**
OBSERVATIONS
VARIABLES
Measurement/Observation
Variable Features of variables
Example Descriptive statistics
Informativeness level
Categorical,Nominal
Unordered/unarranged
categories
Gender, urbanization
Number, proportion
Low
Ordinal Orded/arranged categories
Grades, scales
Median Medium
Numerical Arranged categories with equal intervals
Height, weight
Mean or median
High
Types of data (variables)
CategoricalNominal
QualitativeOrdinal
Numerical
Quantitative
Types of data
Type of data?
Height Grades Age in years Weight Insuline concentration Blood glucose
Type of data?
How many cigarettes do you smoke a day?
1-5 6-10 11-15 16-20 21 and more
Type of data?
Have you ever had a heart attack? Yes No
Do you suffer from hypertension? Yes No ?
Type of data?
Gender: Male Female
Type of data?
Marital status: married divorced widowed single lives alone ?
Type of data?
Education: elementary school high school two-year college four-year college ?
Type of data?
Likert scale
Claim: Violence among the youth is becoming an increasing problem in Croatia.
I agree completely I agree Undecided I disagree I argue strongly against
1 2 3 4 5
Type of data?
Visually analogous scale
E.g. pain level that examinee experiences
I don’t feel pain
I feel intolerable pain
Variables – Transformation of variables
Numerical Distance is meaningfull
Ordinal Atributes can be ordered
Nominal Attributes are only named; weakest
Watch out for… literature inconsistency in data type
classification
Person No. Height [cm]Person 1 148Person 2 142Person 3 154Person 4 153Person 5 160Person 6 177Person 7 204Person 8 192Person 9 191Person 10 203Person 11 197Person 12 202Person 13 177
Observe and Describe
Organized data are input forGraphical & Tabular data representations
Person No. Height [cm]Person 1 148Person 2 142Person 3 154Person 4 153Person 5 160Person 6 177Person 7 204Person 8 192Person 9 191Person 10 203Person 11 197Person 12 202Person 13 177
Observe and Describe
QUALITATIVE DATA
Tabular Techniques for Qualitative Variable(s) –YPEL5 example –
Contingency TableIn one study researchers investigated genotype of the
YPEL5 gene in a population sample from Split. They got the following results on 10 examinees :
IndividualYPEL5 Genotype
1 AA2 BB3 BB4 BB5 AB6 AB7 BB8 AA9 AB
10 BB
Genotype Frequency Relative Frequency
Relative Frequency [%]
AA 2 0.2 20%
AB 3 0.3 30%
BB 5 0.5 50%
Total 10 1.00 100%
Table Frequency Distribution of YPEL5 genotypes
proportion percentage
Graphical Techniques for Qualitative Variable(s) –YPEL5 example – Bar Chart
Bar Charts are often used to display frequencies…
categories’ names
CountsOr Percentages
BB AA AB0
1
2
3
4
5
YPEL5 genotype
Freq
uen
cy
Is there an association between the medicine taken and the length of cold?
(84%)
(16%)
(100%)
(19%)
(81%)
(100%)
Graphical & Tabular Techniques
The only allowable calculation => count the frequency of category.
We can summarize the data in a contingency table that presents the categories and their counts called a frequency distribution.
A relative frequency distribution lists the categories and the proportion with which each occurs.
Graphical Techniques for Qualitative Variable(s) –Bar Chart – pareto chart
Nominal data has no order. However, sometimes it is usefull to arrange the outcomes from the most frequently occurring to the least frequently occurring. We call this bar chart representation a “pareto chart”
categories’ names
counts
Chart with relative frequency is more informative
categories’ names
percentages
Graphical Techniques for Qualitative Variable(s) –Bar Chart – pareto chart
Pie Charts show relative frequencies…
Graphical Techniques for Qualitative Variable(s) –YPEL5 example - Pie Chart
50%
20%
30%
Pie Chart
BBAAAB
Watch out for . . . Authors can use percentages to hide the true size of the data. To say that 50% of a sample has a certain condition
when there are only four people in the sample is clearly not providing the same level of information as 50% of a sample based on 400 people.
So, percentages should be used as an additional help for the reader rather than replacing the actual data
Chart that changed the medicine
QUANTITATIVE DATA
Graphical Technique for Quantitative Data
Height measurements among 1st year medical students
Individual Height (cm)1 1862 1443 1754 1995 1496 1577 1508 1769 179
10 16511 15112 16413 16714 17515 19116 16317 18718 17619 18420 19121 17222 15123 179
Frequency distribution for quantitative data:
Building a Histogram
Building a Histogram…
Category limits[cm] Freq.
Relative Freq.
Percent Relative Freq.
>140;<=150 3 0,13 13%150-160 3 0,13 13%160-170 4 0,17 17%170-180 7 0,30 30%180-190 5 0,22 22%190-200 1 0,04 4%Total 23 1,00 100%
Frequency distribution of height
145 155 165 175 185 1950%5%
10%15%20%25%30%35%
Height [cm]P
erce
nt
Rel
ativ
e Fr
equ
ency
Graphical Techniques for Quantitative Data
There are several graphical methods that are used when the data are quantitative ( numeric).
The most important of these graphical methods is the histogram.
The histogram is not only a powerful graphical technique used to summarize interval data, but it is also used to help explain probabilities.
Histogram simulations http://www.shodor.org/interactivate/activities/Histogram/
Summary: Graphs/Tables for Describing Data
Qualitative Frequency Distribution – tabular summary of data Bar Chart Pie Chart
Quantitative Frequency Distribution – tabular summary of data Histogram Line Chart (Time-Series Plot) Stem and Leaf Display
Relationship between two variables
Relationship between two variables
To compare two variables we use: Scatter plot/diagram (quantitative) Cross table (qualitative)
Scatter plot – for two quantitative variables
Scatter plot, showing the strong association between enzyme activity at pH 5.5 and the 5α-reductase 2-specific mRNA expression, as expressed on the basis of β-actin (n = 30; rs = 0.81; 95% confidence interval, 0.64–0.91; P < 0.0001).
Patterns of Scatter Diagrams…Linearity and Direction are two concepts we
are interested in
Positive Linear Relationship Negative Linear Relationship
Weak or Non-Linear Relationship
Scatter plot
Analysis of expression level from microarray data
Squamous cell carcinoma tumor and perilesional display distinctly different scatter plots from normal tissue. Expresion levels for gene subset 1 in patient 1
Cross Table - for two qualitative variables
Used to compare two qualitative variables
If first variable has r categories, second variable c categories, then we have an r×c cross table.
Association of two qualitative variables
Disease X YES NO TOTAL
YPEL5 Genot
ype
AA 2 0 2
AB 1 3 4BB 0 4 4TOTAL 3 7 10
AA AB BB 0
1
2
3
4
Disease Healthy
Freq
uen
cy
Based on data presented do you think that YPEL5 could be associated with disease X?
The projected clinical cost for breast cancer detection program in 2011-12, broken down by service category.
The results of measuring the height among med. students
Individual Height (cm)1 1862 1443 1754 1995 1496 1577 1508 1769 179
10 16511 15112 16413 16714 17515 19116 16317 18718 17619 18420 19121 17222 15123 179
Histogram – common mistake!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230
50
100
150
200
250
subjects
Height [cm]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 230
50
100
150
200
250
Same data –with reshuffled subjects
subjects
Height [cm]