Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If you know enough about statistics, you don’t need much data.” • Probability distribution • Problems with statisticians’ notation • Hypothesis testing • Regression analysis • Model fitting • Outlier rejection • Data presentation • Experimental design Sir Ronald Aylmer Fisher (1890-1962)
20
Embed
Statistics Overview - Engineering School Class Web Sites · Statistics Overview Biologists say, “If you need to use statistics, you don’t have enough data.” Engineers say, “If
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Statistics Overview
Biologists say, “If you need to use statistics, you don’t have enough data.”
Engineers say, “If you know enough about statistics, you don’t need much data.”
• Probability distribution
• Problems with statisticians’ notation
• Hypothesis testing
• Regression analysis
• Model fitting
• Outlier rejection
• Data presentation
• Experimental designSir Ronald Aylmer Fisher
(1890-1962)
Probability Distribution Functions
If I make a measurement of a variable, how do I know how that sample relates to the mean?
y(x)
x
p[y
(x)]
y(x)
µ
Probability p that a value selected at random from a Gaussian distribution with mean μ and variance σ2 will have value x
µ is the mean of the distribution, given by
σ is the standard deviation of the distribution, given by
Probability P that random variable X will fall between a and b
Normal (Gaussian) Probability
µ= 0; σ= 1
µ= 0; σ= 2
µ= 0; σ= 3
µ= 4; σ= 1
Central Limit Theorem
s 2 is called the variancem and s 2 are first two moments of the PDF
Fit a line to data containing noise using the least squares method
• Minimize the sum of squared residuals
• Model with one independent variable
• Model with p-1 independent variables
• Goodness of fit
– Fraction of variance in data which is explained by model
Nonlinear Regression
Fit an arbitrary function to data containing noise, again using least squares method
R2 isn’t necessarily a good measure of goodness of fit
• L2 norm (Euclidean distance)
• Relative error in L2 norm
Regression AnalysisQualitative Verification
All of these methods assume that the error ε is normally distributed
•Check by looking at plot of residuals
•Residuals should be randomly distributed around axis r = 0
Nonlinear Regression
Fit an arbitrary function to data containing noise, again using least squares method
R2 isn’t necessarily a good measure of goodness of fit
• L2 norm (Euclidean distance)
• Relative error in L2 norm
Model FittingIn Excel
Add trendline – Excel does everything for you
• Only works if you want to use an available function
Goal seek
• Only works for unconstrained, one parameter models
Solver
• Can use for constrained, multiple parameter models
• Uses Quasi-Newton or conjugate gradient method
In MATLAB
Built-in functions
• Newton-Raphson method (fzero)
• Nelder-Mead simplex (fminsearch)
Optimization toolboxes
• Levenberg-Marquadt/Quasi-Newton (fminunc or fmincon)
• Simulated annealing
• Genetic algorithm (GA)
Curve fitting toolbox
Custom algorithm
All methods work by minimizing some error
•
Model Fitting in Excel
Model Fitting in MATLAB
Outlier RejectionWhat is an outlier?
An outlier is a data point which disagrees with the other data and cannot be reproduced
Caused by measurement error, incorrect value of independent variable (i.e. user error), noise, chance, or lack of control or understanding of the process
Example:
y(x)=[1.2, 1.3, 5.0, 1.1, 1.2]T
μ = 1.96; σ = 1.70
When is a point an outlier?
Dixon’s Q Test
• Very simple – just look up a value in a table to see if it’s an outlier
Chauvenet’s Criterion
• Simple, less rigorous
• If p(xi)<1/(2n), throw it out
Grubb’s Test; Peirce’s Criterion
• Both utilize more rigorous methods
• See paper
Without outlier: μ = 1.20; σ = 0.08
What makes a good figure?Clearly relates independent and dependent variables using axes and trend
lines
• Units!!
• Proper scaling– Use log scales if variable(s) vary over
orders of magnitude
Symbols and text are large and different
Resolution is sufficiently high
Error bars (if applicable)
Efficient use of space
Utilizes significant figures appropriately
Compares data with applicable model predictions
Contains enough information to get the point(s) across, but not so much that the message is lost or confused
Captioned such that it is understood without reading the text
Reilly et al., Experimental Eye Research, 2008.
Presentation of Data
Which of these figures is better?
ambiguous
Was
ted
sp
ace
Significant figures
Fuzzy text
Error bars
Goodness-of-Fit
LegendFrom a journal article which was rejected.
Units
Presentation of Data
Reilly et al., Biomacromolecules, 2008.
Tiffany and Koretz, International Journal of Biological Molecules, 2002.
Statistical Experimental Design
Design an experiment using statistical methods to minimize the number of data points required to get the desired information.
Analyze an experiment using statistical methods to maximize the information yield from any set of experiments