Kansas State University Department of Computing and Information Sciences 732: Machine Learning and Pattern Recognition Wednesday, 27 February 2008 William H. Hsu Department of Computing and Information Sciences, KSU http://www.kddresearch.org http://www.cis.ksu.edu/~bhsu Readings: Section 6.11, Han & Kamber 2e Chapter 1, Sections 6.1-6.5, Goldberg Sections 9.1-9.4, Mitchell Regression and Prediction Lecture 16 of 42 Lecture 16 of 42
42
Embed
Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 27 February 2008.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Wednesday, 27 February 2008
William H. Hsu
Department of Computing and Information Sciences, KSUhttp://www.kddresearch.org
http://www.cis.ksu.edu/~bhsu
Readings:
Section 6.11, Han & Kamber 2e
Chapter 1, Sections 6.1-6.5, Goldberg
Sections 9.1-9.4, Mitchell
Regression and Prediction
Lecture 16 of 42Lecture 16 of 42
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Hypothesis testing with CorrelationsHypothesis testing with Correlations
• Two possibilities– Ho: ρ = 0 (no actual correlation; The Null Hypothesis)– Ha: ρ ≠ 0 (there is some correlation; The Alternative Hyp.)
• Case #1 (see correlation worksheet)
– Correlation between distance and points r = -.904– Sample small (n=6), but r is very large– We guess ρ < 0 (we guess there is some correlation in the pop.)
• Case #2– Correlation between aiming and points, r = .628– Sample small (n=6), and r is only moderate in size– We guess ρ = 0 (we guess there is NO correlation in pop.)
• Bottom-line– We can only guess about ρ – We can be wrong in two ways
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Predictive PotentialPredictive Potential
• Coefficient of Determination– r² – Amount of variance accounted for in y by x– Percentage increase in accuracy you gain by using the regression line to make
predictions – Without correlation, you can only guess the mean of y– [Used with regression]
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Time Series PredictionTime Series PredictionForecasting the Future andForecasting the Future and
Understanding the PastUnderstanding the PastSanta Fe Institute Proceedings on the Studies in the Sciences of ComplexitySanta Fe Institute Proceedings on the Studies in the Sciences of Complexity
Edited by Andreas Weingend and Neil GershenfeldEdited by Andreas Weingend and Neil Gershenfeld
NIST Complex System ProgramNIST Complex System ProgramPerspectives on Standard Benchmark DataPerspectives on Standard Benchmark Data
In Quantifying Complex SystemsIn Quantifying Complex Systems
Vincent StanfordVincent Stanford
Complex Systems Test Bed projectComplex Systems Test Bed project
August 31, 2007August 31, 2007
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Chaos in Nature, Theory, and TechnologyChaos in Nature, Theory, and Technology
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Rings of SaturnRings of Saturn Lorentz AttractorLorentz Attractor Aircraft dynamics at Aircraft dynamics at high angles of attackhigh angles of attackAircraft dynamics at Aircraft dynamics at high angles of attackhigh angles of attack
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Time Series PredictionTime Series Prediction A Santa Fe Institute competition using standard data setsA Santa Fe Institute competition using standard data sets
• Santa Fe Institute (SFI) founded in 1984 to “… focus the tools of traditional scientific disciplines and emerging computer resources on … the multidisciplinary study of complex systems…”
• “This book is the result of an unsuccessful joke. … Out of frustration with the fragmented and anecdotal literature, we made what we thought was a humorous suggestion: run a competition. …no one laughed.”
• Time series from physics, biology, economics, …, beg the same questions:– What happens next?– What kind of system produced this time series?– How much can we learn about the producing system?
• Quantitative answers can permit direct comparisons• Make some standard data sets in consultation with subject matter experts in a variety of areas.• Very NISTY; but we are in a much better position to do this in the age of Google and the Internet.
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Selecting benchmark data setsSelecting benchmark data setsFor inclusion in the bookFor inclusion in the book
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Time honored linear modelsTime honored linear models
• Auto Regressive Moving Average (ARMA)• Many linear estimation techniques based on Least Squares, or Least Mean Squares• Power spectra, and Autocorrelation characterize such linear systems• Randomness comes only from forcing function x(t)
y[t 1] ai y[t i]i0
NAR
b j
j0
N MA
x[t i]
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
• Spectrum, autocorrelation, characterize linear systems, not these
• Deterministic chaos looks random to linear analysis methods
• Logistic map is an early example (Elam 1957).
x[t 1] rx[t](1 x[t])
Logisic map 2.9 < r < 3.99
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Understanding and learningUnderstanding and learningcomments from SFIcomments from SFI
• Weak to Strong models - many parameters to few• Data poor to data rich • Theory poor to theory rich• Weak models progress to strong, e.g. planetary motion:
– Tycho Brahe: observes and records raw data– Kepler: equal areas swept in equal time – Newton: universal gravitation, mechanics, and calculus– Poincaré: fails to solve three body problem– Sussman and Wisdom: Chaos ensues with computational solution!
• Is that a simplification?
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Discovering properties of dataDiscovering properties of dataand inferring (complex) modelsand inferring (complex) models
• Can’t decompose an output into the product of input and transfer function Y(z)=H(z)X(z) by doing a Z, Laplace, or Fourier transform.
• Linear Perceptrons were shown to have severe limitations by Minsky and Papert• Perceptrons with non-linear threshold logic can solve XOR and many classifications not available with
linear version• But according to SFI: “Learning XOR is as interesting as memorizing the phone book. More interesting -
and more realistic - are real-world problems, such as prediction of financial data.”• Many approaches are investigated
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Time delay embeddingTime delay embeddingDiffers from traditional experimental measurementsDiffers from traditional experimental measurements
– Provides detailed information about degrees of freedom beyond the scalar measured
– Rests on probabilistic assumptions - though not guaranteed to be valid for any particular system
– Reconstructed dynamics are seen through an unknown “smooth transformation”– Therefore allows precise questions only about invariants under “smooth
transformations”– It can still be used for forecasting a time series and “characterizing essential
features of the dynamics that produced it”
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Time delay embedding theoremsTime delay embedding theorems“The most important Phase Space Reconstruction technique is the method of delays”“The most important Phase Space Reconstruction technique is the method of delays”
– Assuming the dynamics f(X) on a V dimensional manifold has a strange attractor A with box counting dimension dA
– s(X) is a twice differentiable scalar measurement giving {sn}={s(Xn)} – M is called the embedding dimension is generally referred to as the delay, or lag – Embedding theorems: if {sn} consists of scalar measurements of the state a dynamical system then, under suitable
hypotheses, the time delay embedding {Sn} is a one-to-one transformed image of the {Xn}, provided M > 2dA. (e.g. Takens 1981, Lecture Notes in Mathematics, Springer-Verlag; or Sauer and Yorke, J. of Statistical Physics, 1991)
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Time series predictionTime series predictionMany different techniques thrown at the data to “see if anything sticks”Many different techniques thrown at the data to “see if anything sticks”
Examples:
– Delay coordinate embedding - Short term prediction by filtered delay coordinates and reconstruction with local linear models of the attractor (T. Sauer).
– Neural networks with internal delay lines - Performed well on data set A (E. Wan), (M. Mozer)– Simple architectures for fast machines - “Know the data and your modeling technique” (X.
Zhang and J. Hutchinson)– Forecasting pdf’s using HMMs with mixed states - Capturing “Embedology” (A. Frasar and A.
Dimiriadis)– More…
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Time series characterizationTime series characterizationMany different techniques thrown at the data to “see if anything sticks”Many different techniques thrown at the data to “see if anything sticks”
Examples:
– Stochastic and deterministic modeling - Local linear approximation to attractors (M. Kasdagali and A. Weigend)
– Estimating dimension and choosing time delays - Box counting (F. Pineda and J. Sommerer)– Quantifying Chaos using information-theoretic functionals - mutual information and
nonlinearity testing.(M. Palus)– Statistics for detecting deterministic dynamics - Course grained flow averages (D. Kaplan)– More…
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
What to make of this?What to make of this?Handbook for the corpus driven study of nonlinear dynamicsHandbook for the corpus driven study of nonlinear dynamics
Very NISTY:– Convene a panel of leading researchers– Identify areas of interest where improved characterization and predictive measurements can
be of assistance to the community– Identify standard reference data sets:
• Development corpra• Test sets
– Develop metrics for prediction and characterization– Evaluate participants– Is there a sponsor? – Are there areas of special importance to communities we know? For example: predicting
catastrophic failures of machines from sensors.
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
Ideas?Ideas?
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Kansas State University
Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition
TerminologyTerminology
• Evolutionary Computation (EC): Models Based on Natural Selection
• Genetic Algorithm (GA) Concepts– Individual: single entity of model (corresponds to hypothesis)
– Population: collection of entities in competition for survival
– Generation: single application of selection and crossover operations
– Schema aka building block: descriptor of GA population (e.g., 10**0*)
– Schema theorem: representation of schema proportional to its relative fitness