Top Banner
1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD [email protected]
58
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

1

Introduction to Statistics

Colm O’Dushlaine

Neuropsychiatric Genetics, [email protected]

Page 2: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

2

Overview

Descriptive Statistics & Graphical Presentation of Data

Statistical Inference Hypothesis Tests & Confidence Intervals T-tests (Paired/Two-sample) Regression (SLR & Multiple Regression) ANOVA/ANCOVA

Intended as an interview. Will provide slides after lectures

What’s in the lectures?...

Page 3: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

3

Lecture 1 Lecture 2 Lecture 3 Lecture 4 Descriptive Statistics and Graphical Presentation of Data

1. Terminology

2. Frequency Distributions/Histograms

3. Measures of data location

4. Measures of data spread

5. Box-plots

6. Scatter-plots

7. Clustering (Multivariate Data)

Page 4: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

4

Lecture 1 Lecture 2 Lecture 3 Lecture 4 Statistical Inference

1. Distributions & Densities

2. Normal Distribution

3. Sampling Distribution & Central Limit Theorem

4. Hypothesis Tests

5. P-values

6. Confidence Intervals

7. Two-Sample Inferences

8. Paired Data

Page 5: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

5

Lecture 1 Lecture 2 Lecture 3 Lecture 4 Sample Inferences

1. Two-Sample Inferences Paired t-test Two-sample t-test

2. Inferences for more than two samples One-way ANOVA Two-way ANOVA Interactions in Two-way ANOVA

3. DataDesk demo

Page 6: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

6

Lecture 1 Lecture 2 Lecture 3 Lecture 4

1. Regression2. Correlation3. Multiple Regression4. ANCOVA5. Normality Checks6. Non-parametrics7. Sample Size Calculations8. Useful tools and websites

Page 7: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

7

Explanations of outputs

Videos with commentary

Help with deciding what test to use with what data

FIRST, A REALLY USEFUL SITE

Page 8: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

8

1. TerminologyPopulations & Samples

Population: the complete set of individuals, objects or scores of interest. Often too large to sample in its entirety It may be real or hypothetical (e.g. the results from an

experiment repeated ad infinitum)

Sample: A subset of the population. A sample may be classified as random (each member

has equal chance of being selected from a population) or convenience (what’s available).

Random selection attempts to ensure the sample is representative of the population.

Page 9: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

9

Variables

Variables are the quantities measured in a sample.They may be classified as: Quantitative i.e. numerical

Continuous (e.g. pH of a sample, patient cholesterol levels)

Discrete (e.g. number of bacteria colonies in a culture)

Categorical Nominal (e.g. gender, blood group) Ordinal (ranked e.g. mild, moderate or severe

illness). Often ordinal variables are re-coded to be quantitative.

Page 10: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

10

Variables

Variables can be further classified as: Dependent/Response. Variable of primary interest

(e.g. blood pressure in an antihypertensive drug trial). Not controlled by the experimenter.

Independent/Predictor called a Factor when controlled by experimenter. It

is often nominal (e.g. treatment) Covariate when not controlled.

If the value of a variable cannot be predicted in advance then the variable is referred to as a random variable

Page 11: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

11

Parameters & Statistics

Parameters: Quantities that describe a population characteristic. They are usually unknown and we wish to make statistical inferences about parameters. Different to perimeters.

Descriptive Statistics: Quantities and techniques used to describe a sample characteristic or illustrate the sample data e.g. mean, standard deviation, box-plot

Page 12: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

12

2. Frequency Distributions

An (Empirical) Frequency Distribution or Histogram for a continuous variable presents the counts of observations grouped within pre-specified classes or groups

A Relative Frequency Distribution presents the corresponding proportions of observations within the classes

A Barchart presents the frequencies for a categorical variable

Page 13: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

13

Example – Serum CK

Blood samples taken from 36 male volunteers as part of a study to determine the natural variation in CK concentration.

The serum CK concentrations were measured in (U/I) are as follows:

Page 14: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

14

Serum CK Data for 36 male volunteers

121 82 100 151 68 58

95 145 64 201 101 163

84 57 139 60 78 94

119 104 110 113 118 203

62 83 67 93 92 110

25 123 70 48 95 42

Page 15: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

15

Relative Frequency TableSerum CK

(U/I)Frequency Relative

FrequencyCumulative Rel.

Frequency

20-39 1 0.028 0.028

40-59 4 0.111 0.139

60-79 7 0.194 0.333

80-99 8 0.222 0.555

100-119 8 0.222 0.777

120-139 3 0.083 0.860

140-159 2 0.056 0.916

160-179 1 0.028 0.944

180-199 0 0.000 0.944

200-219 2 0.056 1.000

Total 36 1.000

Page 16: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

16

Frequency Distribution

2

4

6

8

Fre

quen

cy

20 40 60 80 100 120 140 160 180 200 220

100.0%

99.5%

97.5%

90.0%

75.0%

50.0%

25.0%

10.0%

2.5%

0.5%

0.0%

maximum

quartile

median

quartile

minimum

203.00

203.00

203.00

154.60

118.75

94.50

67.25

54.30

25.00

25.00

25.00

Quantiles

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

98.277778

40.380767

6.7301278

111.94066

84.614892

36

Moments

CK-concentration-(U/l)

Distributions

Page 17: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

17

0.05

0.10

0.15

0.20

Rel

ativ

e F

requ

ency

20 40 60 80 100 120 140 160 180 200 220

100.0%

99.5%

97.5%

90.0%

75.0%

50.0%

25.0%

10.0%

2.5%

0.5%

0.0%

maximum

quartile

median

quartile

minimum

203.00

203.00

203.00

154.60

118.75

94.50

67.25

54.30

25.00

25.00

25.00

Quantiles

Mean

Std Dev

Std Err Mean

upper 95% Mean

lower 95% Mean

N

98.277778

40.380767

6.7301278

111.94066

84.614892

36

Moments

CK-concentration-(U/l)

DistributionsRelative Frequency Distribution

Mode

Left tail

Right tail

(skewed)

Shaded area is percentage of males with CK values between 60 and 100 U/l, i.e. 42%.

Page 18: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

18

3. Measures of Central Tendency (Location)Measures of location indicate where on the number line the data are to be found. Common measures of location are:

(i) the Arithmetic Mean,

(ii) the Median, and

(iii) the Mode

Page 19: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

19

The Mean

Let x1,x2,x3,…,xn be the realised values of a random variable X, from a sample of size n. The sample arithmetic mean is defined as:

n

iinxx

1

1

Page 20: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

20

Example

Example 2: The systolic blood pressure of seven middle aged men were as follows:

151, 124, 132, 170, 146, 124 and 113.

The mean is

14.1377

113124146170132124151

x

Page 21: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

21

The Median and Mode

If the sample data are arranged in increasing order, the median is

(i) the middle value if n is an odd number, or

(ii) midway between the two middle values if n is an even number

The mode is the most commonly occurring value.

Page 22: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

22

Example 1 – n is odd

The reordered systolic blood pressure data seen earlier are:

113, 124, 124, 132, 146, 151, and 170.

The Median is the middle value of the ordered data, i.e. 132.

Two individuals have systolic blood pressure = 124 mm Hg, so the Mode is 124.

Page 23: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

23

Example 2 – n is even

Six men with high cholesterol participated in a study to investigate the effects of diet on cholesterol level. At the beginning of the study, their cholesterol levels (mg/dL) were as follows:

366, 327, 274, 292, 274 and 230.

Rearrange the data in numerical order as follows:

230, 274, 274, 292, 327 and 366.

The Median is half way between the middle two readings, i.e. (274+292) 2 = 283.

Two men have the same cholesterol level- the Mode is 274.

Page 24: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

24

Mean versus Median

Large sample values tend to inflate the mean. This will happen if the histogram of the data is right-skewed.

The median is not influenced by large sample values and is a better measure of centrality if the distribution is skewed.

Note if mean=median=mode then the data are said to be symmetrical

e.g. In the CK measurement study, the sample mean = 98.28. The median = 94.5, i.e. mean is larger than median indicating that mean is inflated by two large data values 201 and 203.

Page 25: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

25

4. Measures of Dispersion

Measures of dispersion characterise how spread out the distribution is, i.e., how variable the data are.

Commonly used measures of dispersion include:1. Range

2. Variance & Standard deviation

3. Coefficient of Variation (or relative standard deviation)

4. Inter-quartile range

Page 26: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

26

Range

the sample Range is the difference between the largest and smallest observations in the sample

easy to calculate; Blood pressure example: min=113 and

max=170, so the range=57 mmHg useful for “best” or “worst” case scenarios sensitive to extreme values

Page 27: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

27

Sample Variance

The sample variance, s2, is the arithmetic mean of the squared deviations from the sample mean:

11

2

2

n

xxs

n

ii

>

Page 28: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

28

Standard Deviation

The sample standard deviation, s, is the square-root of the variance

s has the advantage of being in the same units as the original variable x

11

2

n

xxs

n

ii

Page 29: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

29

Example

Data Deviation Deviation2

151 13.86 192.02

124 -13.14 172.73

132 -5.14 26.45

170 32.86 1079.59

146 8.86 78.45

124 -13.14 172.73

113 -24.14 582.88

Sum = 960.0 Sum = 0.00 Sum = 2304.8614.137x

Page 30: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

30

Example (contd.)

Therefore,

86.23047

1

2 i

i xx

6.1917

86.2304

s

Page 31: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

31

Coefficient of Variation The coefficient of variation (CV) or relative

standard deviation (RSD) is the sample standard deviation expressed as a percentage of the mean, i.e.

The CV is not affected by multiplicative changes in scale

Consequently, a useful way of comparing the dispersion of variables measured on different scales

%100

x

sCV

Page 32: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

32

Example

The CV of the blood pressure data is:

i.e., the standard deviation is 14.3% as large as the mean.

%3.14

%1.137

6.19100

CV

Page 33: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

33

Inter-quartile range

The Median divides a distribution into two halves.

The first and third quartiles (denoted Q1 and Q3) are defined as follows: 25% of the data lie below Q1 (and 75% is above Q1),

25% of the data lie above Q3 (and 75% is below Q3)

The inter-quartile range (IQR) is the difference between the first and third quartiles, i.e. IQR = Q3- Q1

Page 34: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

34

Example

The ordered blood pressure data is:

113 124 124 132 146 151 170

Q1 Q3

Inter Quartile Range (IQR) is 151-124 = 27

Page 35: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

35

60% of slides complete!

Page 36: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

36

5. Box-plots

A box-plot is a visual description of the distribution based on Minimum Q1 Median Q3 Maximum

Useful for comparing large sets of data

Page 37: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

37

Example 1

The pulse rates of 12 individuals arranged in increasing order are:

62, 64, 68, 70, 70, 74, 74, 76, 76, 78, 78, 80

Q1=(68+70)2 = 69, Q3=(76+78)2 = 77

IQR = (77 – 69) = 8

Page 38: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

38

Example 1: Box-plot

Page 39: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

39

Example 2: Box-plots of intensities from 11 gene expression arrays

AG_04659_AS.cel AG_11745_AS.cel KB_5828_AS.cel KB_8840_AS.cel

81

01

21

4

Page 40: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

40

Outliers

An outlier is an observation which does not appear to belong with the other data

Outliers can arise because of a measurement or recording error or because of equipment failure during an experiment, etc.

An outlier might be indicative of a sub-population, e.g. an abnormally low or high value in a medical test could indicate presence of an illness in the patient.

Page 41: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

41

Outlier Boxplot

Re-define the upper and lower limits of the boxplots (the whisker lines) as:

Lower limit = Q1-1.5IQR, and

Upper limit = Q3+1.5IQR

Note that the lines may not go as far as these limits

If a data point is < lower limit or > upper limit, the data point is considered to be an outlier.

Page 42: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

42

Example – CK data

outliers

Page 43: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

43

6. Scatter-plot

Displays the relationship between two continuous variables

Useful in the early stage of analysis when exploring data and determining is a linear regression analysis is appropriate

May show outliers in your data

Page 44: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

44

Example 1: Age versus Systolic Blood Pressure in a Clinical Trial

Page 45: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

45

Example 2: Up-regulation/Down-regulation of gene expression across an array (Control Cy5 versus Disease Cy3)

Page 46: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

46

Example of a Scatter-plot matrix (multiple pair-wise plots)

Page 47: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

47

Other graphical representations

Dot-Plots, Stem-and-leaf plots Not visually appealing

Pie-chart Visually appealing, but hard to compare two datasets. Best

for 3 to 7 categories. A total must be specified. Violin-plots

=boxplot+smooth density Nice visual of data shape

Page 48: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

48

Clustering is useful for visualising multivariate data and uncovering patterns, often reducing its complexity

Clustering is especially useful for high-dimensional data (p>>n): hundreds or perhaps thousands of variables

An obvious areas of application are gel electrophoresis and microarray experiments where the variables are protein abundances or gene expression ratios

Multivariate Data

Page 49: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

49

7. Clustering

Aim: Find groups of samples or variables sharing similiarity

Clustering requires a definition of distance between objects, quantifying a notion of (dis)similarity Points are grouped on the basis on minimum distance

apart (distance measures)

Once a pair are grouped, they are combined into a single point (using a linkage method) e.g. take their average. The process is then repeated.

Page 50: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

50

Clustering

Clustering can be applied to rows or columns of a data set (matrix) i.e. to the samples or variables

A tree can be constructed with branch length proportional to distances between linked clusters, called a Dendrogram

Clustering is an example of unsupervised learning: No use is made of sample annotations i.e. treatment groups, diagnosis groups

Page 51: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

51

UPGMA

Unweighted Pair-Group Method Average Most commonly used clustering method Procedure:

1. Each observation forms its own cluster 2. The two with minimum distance are grouped into a single

cluster representing a new observation- take their average 3. Repeat 2. until all data points form a single cluster

Page 52: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

52

Contrived Example

 Array1 Array2 Array3

p53 9 3 7

mdm2 10 2 9

bcl2 1 9 4

cyclinE 6 5 5

caspase 8 1 10 3

5 genes of interest on 3 replicates arrays/gels

Calculate distance between each pair of genes

5.2)97()23()109()2,53(.. 222 mdmpdge

Page 53: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

53

Example

Construct a distance matrix of all pair-wise distances

 p53 mdm2 bcl2 cyclinE caspase 8

p53 0 2.5 10.44 4.12 11.75

mdm2 - 0 12.5 6.4 13.93

bcl2 - - 0 6.48 1.41

cyclinE - - - 0 7.35

caspase 8 - - - - 0

Cluster the 2 genes with smallest distance Take their average & re-calculate distances to other genes

Page 54: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

54

 {p53 &

mdm2}cyclin E

{caspase-8 & bcl-2}

{p53 & mdm2} 0 3.7 9.2

cyclin E 0 6.9

{caspase-8 & bcl-2} 0

 p53 mdm2 cyclin E

{caspase-8 & bcl-2}

p53 0 2.5 4.12 10.9

mdm2 0 6.4 9.1

cyclin E 0 6.9

{caspase-8 & bcl-2}

0

Page 55: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

55

Example (contd)

..and the final cluster:

Page 56: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

56

Example of a gene expression dendrogram

Page 57: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

57

Variety of approaches to clustering • Clustering techniques

– agglomerative -start with every element in its own cluster, and iteratively join clusters together

– divisive - start with one cluster and iteratively divide it into smaller clusters

• Distance Metrics– Euclidean (as-the-crow-flies)– Manhattan – Minkowski (a whole class of metrics)– Correlation (similarity in profiles: called similarity metrics)

• Linkage Rules – average: Use the mean distance between cluster members– single: Use the minimum distance (gives loose clusters)– complete: Use the maximum distance (gives tight clusters)– median: Use the median distance– centroid: Use the distance between the “average” member or

each cluster

Page 58: 1 Introduction to Statistics Colm O’Dushlaine Neuropsychiatric Genetics, TCD codushlaine@gmail.com.

58

Clustering Summary

The clusters & tree topology often depend highly on the distance measure and linkage method used

Recommended to use two distance metrics, such as Euclidean and a correlation metric

A clustering algorithm will always yield clusters, whether the data are organised in clusters or not!