Dr. Gari D. Clifford, University Lecturer & Associate Director, Centre for Doctoral Training in Healthcare Innovation, Institute of Biomedical Engineering, University of Oxford Information Driven Healthcare: Data Visualization & Classification Lecture 1: Introduction & preprocessing Centre for Doctoral Training in Healthcare Innovation
31
Embed
Dr. Gari D. Clifford, University Lecturer & Associate Director, Centre for Doctoral Training in Healthcare Innovation, Institute of Biomedical Engineering,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Dr. Gari D. Clifford, University Lecturer & Associate Director, Centre for Doctoral Training in Healthcare Innovation,Institute of Biomedical Engineering, University of Oxford
Information Driven Healthcare:Data Visualization & Classification
Lecture 1: Introduction & preprocessing
Centre for Doctoral Training in Healthcare Innovation
The course
A practical overview of (a subset of) classifiers and visualization tools
Data preparation, PCA, K-means Clustering, KNN Statistics, regression, LDA, logistic regression Neural Networks Gaussian Mixture models, EM Support Vector Machines
Labs – try to 1) Classify flowers (classic dataset), … then 2) Predict mortality in the ICU! (... & publish if you
do well!)
Workload
Two lectures each morning
Five 4-hour labs (each afternoon)
Read one article each eve (optional)
Assessment /assignments
Class interaction
Lab diary – write up notes as you perform investigations – submit lab code (m-file) and Word/OO doc answering the questions at 5pm each day …
No paper please!
Absolutely no homework!
... but you can write a paper afterwards if your results are good!
Course texts
Ian Nabney, Netlab, Algorithms for Pattern Recognition, in their series Advances in Pattern Recognition. Springer (2001) ISBN 1-85233-440-1 http://www.ncrg.aston.ac.uk/netlab/book.php
Christopher M. Bishop, Pattern Recognition and Machine Learning Springer (2006) ISBN 0-38-731073-8 http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm
Press, Teukolsky, Vetterling & Flannery, Numerical Recipes in C, the Art of Scientific Computing, 2nd Edition, Cambrige University Press, 1992. [Ch. 2.6, 10.5(p414-417), 11.0(p465-460), 15.4(p671-688), 15.5(p681-688), 15.6&15.7(p689-700)] Online at http://www.nrbook.com/a/bookcpdf.php
L. Tarassenko, A Guide to Neural Computation, John Wiley & Sons (February 1998) Ch. 7 (p77-101)
Wednesday Optimization and Neural Networks [GDC]Lecture 5: (9.30:10.30am) ANNs - RBFs and MLPs - choosing an architecture, balancing the data. Lecture 6: (11am-12pm) Training & optimization, N-fold validation.Lab 3 (1-5pm) Training an MLP to classify flower types and then mortality - partitioning and balancing data,Reading for tomorrow: Netlab: Ch3.1-3.4 p79-100
Thursday Probabilistic Methods [DAC]Lecture 7: (9.30:10.30am) GMM, MCMC, Density EstimationLecture 8: (11am-12pm) EM, Variation Bayes, missing dataLab 4 (1-5pm) GMM and EMReading for tomorrow: Bishop: Ch7 p325-345 (SVM)
Friday Support Vector Machines [CO/GDC]Lecture 9: (9.30:10.30am) SVMs and constrained optimizationLecture 10: (11am-12pm) Wrap-upLab 5 (1-5pm) Use SVM toolbox and vary 2 parameters for regression & classification (1 class death and
then alive), then 2 class.
Overview of data for lab
You will be given two datasets:
1. A simple dataset for learning – Fisher’s Iris dataset
2. A complex ICU database (if this works – publish!!!)
In each lab you will use dataset 1 to understand the problem, then dataset 2 to see how you can apply this to more challenging data
So let’s start … what are we doing?
Trying to learn classes from data so when we see new data, we can make a good guess concerning it’s class membership (e.g. is this patient part of the set of people likely to die and if so, can we change his/her treatment)
How do we do this? Supervised – use labelled data to train an algorithm. Unsupervised – use heuristics or metrics to look for
clusters in data (K-means clustering, KNN, SOMs, GMM, …)
Data preprocessing/manipulation
Filter data to remove outliers (reject obvious large/small values)
Zero-mean, unit variance data if parameters are not in same units!
Compress data into lower dimensions to reduce workload or to visualize data relationships
Rotate data, or expand into higher dimensions to improve the separation between classes.
The curse of dimensionality
Richard Bellman (1953) coined the term The Curse of Dimensionality (or Hughes effect)
It’s the problem caused by the exponential increase in volume associated with adding extra dimensions to a (mathematical) space.
Bellman gives the following example:
Given 100 evenly-spaced sample points suffice to sample a unit interval with no more than 0.01 distance between points;
An equivalent sampling of a 10D unit hypercube with a lattice with a spacing of 0.01 between adjacent points would require 1020 sample points:
Therefore, at this spatial sampling resolution, the 10-dimensional hypercube is a factor of 1018 ‘larger’ than the unit interval.
mup
pet.
wik
ia.c
om
So what does that mean for us? Need to think about how much data we have and how many
parameters we use.
“Rule of thumb”: need to have at least 10 training samples of each class per input feature dimension (although this depends on separability of data and can be up to 30 for complex problems and as low as 2-5 for simple problems [*])
So for the Iris dataset – we have 4 measured features on 50 examples of each of the three classes … so we have enough!
For ICU data we have 1400 patients, 970 survived and 430 died … so taking the minimum of these we could use up to 43 of the 112 features
Generally though you need more data …
Or you compress the data into a smaller number of dimensions
[*] Thomas G. Van Niel, Tim R. McVicarb and Bisun Datt, On the relationship between training sample size and data dimensionality: Monte Carlo analysis of broadband multi-temporal classification, Remote Sensing of Environment, Volume 98, Issue 4, 30 October 2005, Pages 468-480 doi:10.1016/j.rse.2005.08.011
Principal Component Analysis (PCA)
Standard signal/noise separation method
Compress data into lower dimensions to reduce workload or to visualize data relationships
Rotate data to improve the separation between classes
Also known as Karhunen-Loève (KL) transform or the Hotelling transform or Singular Value Decomposition (SVD) – although SVD is actually a mathematical method of PCA
Principal Component Analysis (PCA)
A form of Blind source Separation – an observation, X , can be broken down into a mixing matrix , A , and a set of basis functions , Z :
X=AZ
Second order decorrelation = independence Find a set of orthogonal axes in the data
(independence metric = variance) Project data onto these axes to decorrelate Independence is forced onto the data through
the orthogonality of axes
Two dimensional example
Where are the principal components?
Hint: axes of maximum variation, and orthogonal
Two dimensional example Gives best axis to project minimum
RMS error
Data becomes ‘sphered’ or whitened / decorrelated
Singular Value Decomposition (SVD)
Decompose observation X=AZ into….
X=USVT
S is a diagonal matrix of singular values with elements arranged in descending order of magnitude (the singular spectrum)
The columns of V are the eigenvectors of C=XTX (the orthogonal subspace … dot(vi,vj)=0 ) … they ‘demix’ or rotate the data
U is the matrix of projections of X onto the eigenvectors of C … the ‘source’ estimates
SVD – matrix algebra
Decompose observation X=AZ into….
X=USVT
Eigenspectrum of decomposition
S = singular matrix … zeros except on the leading diagonal
Sij (i=j) are the eigenvalues½
Placed in order of descending magnitude
Correspond to the magnitude of projected data along each eigenvector
Eigenvectors are the axes of maximal variation in the data
Variance = power
(analogous to Fourier components in power spectra)
[stem(diag(S).^2)] Eigenspectrum= Plot of eigenvalues
SVD: Method for PCA
See BSS notes and example at end of presentation
SVD for noise/signal separation
To perform SVD filtering of a signal, use a truncated SVD decomposition (using the first p eigenvectors)
Y=USpVT
[Reduce the dimensionality of the data by discarding noise projections Snoise=0 Then reconstruct the data with just the signal subsapce]
Most of the signal is contained in the first few principal components.
Discarding these and projecting back into the original observation space effects a noise-filtering or a noise/signal separation
e.g.
Imagine a ‘spectral decomposition’ of the matrix:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
= x xu1 u2
1
2
v1
v2
SVD – Dimensionality reduction
How exactly is dimension reduction performed?
A: Set the smallest singular values to zero:
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
=9.64 0
0 5.29x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
SVD – Dimensionality reduction
… note approximation sign
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
0.18
0.36
0.18
0.90
0
00
9.64x
0.58 0.58 0.58 0 0
x~
SVD - Dimensionality reduction
… and resultant matrix is an approximation using only 3 eigenvectors
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
~
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 0 0
0 0 0 0 00 0 0 0 0
Real ECG data example
X
Xp =USpVT
S2
Xp … p=2
n
Xp … p=4
Recap - PCA
Second order decorrelation = independence Find a set of orthogonal axes in the data
(independence metric = variance) Project data onto these axes to decorrelate Independence is forced onto the data through the
orthogonality of axes Conventional noise / signal separation technique Often used as a method of initializing weights for
neural network and other learning algorithms (see Wed lectures).