Multivariate Analysis (ST3007) · • Each microarray slide (n= 83) corresponds to an individual suffering from one of four tumour types (EWS, BLC, NB and RMS). • Each slide reports
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Old Faithful Geyser• The waiting time between eruptions and the duration of the eruption for
the Old Faithful geyser in Yellowstone National Park, Wyoming, USA were
recorded for 272 consecutive eruptions.
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
7 4.700 88
8 3.600 85
....................
271 1.817 46
272 4.467 74
13
Resting Pulse• A researcher is interested in understanding the effect of smoking and weight
upon resting pulse rate (Low/High).
PULSE SMOKES WEIGHT
1 Low No 140
2 Low No 145
3 Low Yes 160
4 Low Yes 190
5 Low No 155
6 Low No 165
7 High No 150
8 Low No 190
....................
9 High No 150
10 Low No 108
• In this case, some of the variables are categorical.
14
G-Force Blackout• Military pilots sometimes black out when their brains are deprived of
oxygen due to G-forces during violent maneuvers. Glaister and Miller
(1990) produced similar symptoms by exposing volunteers lower bodies to
negative air pressure, likewise decreasing oxygen to the brain. The data lists
the subjects’ ages and whether they showed syncopal blackout related signs
during an 18 minute period.
Subject Age BlackOut
JW 39 0
JM 42 1
DT 20 0
LK 37 1
JK 20 1
MK 21 0
FP 41 1
DG 52 1
15
Bank Loans• A bank in South Carolina looked at a data set consisting of 750 applicationsfor loans. For each application, the following variables were recorded.
Variable Description
SUCCESS Was the loan given or not?
AI Applicant’s (and co-applicant’s) income
XMD Applicant’s debt minus mortgage payment
DF 1=Female, 0=Male
DR 1=Non white, 0=White
DS 1=Single, 0=Not Single
DA Age of house
NNWP Percentage non-white in neighbourhood
NMFI Mean family income in neighbourhood
NA Mean age of houses in neighbourhood
• What determines the granting/not granting of a loan?
16
Microarray data (tumour types)
• Each microarray slide (n = 83) corresponds to an individual suffering from
one of four tumour types (EWS, BLC, NB and RMS).
• Each slide reports ‘expression levels’ (i.e. a number generally between -6
and +6) for the 2308 genes.
• The aim is to examine which patients are most similar to each other in
terms of their expression profiles across genes.
• We have a large number of features J = 2308 and small number of
observations (n = 83).
17
Graphical Summaries• It is worth plotting your data to look for interesting structure.
• For two continuous variables, a scatter plot is a good choice.
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
5060
7080
90
eruptions
wai
ting
18
Heptathlon Data: Scatter plot matrix• A scatter plot matrix of the heptathlon data shows some relationships
between the seven events.
X100m
1.65 1.80 1.95 23.0 24.5 38 42 46 50
13.0
13.8
1.65
1.80
1.95
HighJump
ShotPutt
1214
16
23.0
24.5
X200m
LongJump
5.6
6.2
3844
50
Javelin
13.0 13.6 14.2 12 14 16 5.6 6.0 6.4 130 140 150
130
140
150
X800m
19
Multivariate analysis
Our objective is to:
• Analyze the internal structure of the data in order to understand and/or
explain the data in a reduced number of dimensions, e.g., Principal
Components, or Factor Analysis.
• Assign group membership:
1. supervised methods — assume a given structure within the data, e.g.,
classification, or discriminant analysis.
2. unsupervised methods — discover structure from the data alone, e.g.,
cluster analysis.
20
Data Reduction• Multivariate data can have many variables recorded.
• Sometimes we can find a way of producing a reduced number of variables
that contain most of the information of the original data.
• This is the aim of data reduction.
• Example: In the decathlon (heptathlon) data, the athletes are awarded
points. The points are supposed to measure the overall ability of the
athlete. Good athletes have high points and poor athletes have low points.
The points score is a single number that is used to describe the athlete’s
performance in ten (seven) events.
• We will look at data-driven methods of reducing the dimensionality of the
data that retains much of the information contained in the original data.
• The method of data reduction used depends on what information we wish
to retain.
21
Principal Components Analysis• Principal components analysis (PCA) finds linear combinations of the
variables in the data which capture most of the variation in the data.
• The output from a PCA of the heptathlon data is as follows: