http://publicationslist.org/junio
Data AnalysisTwo variables: establishing relationships
Prof. Dr. Jose Fernando Rodrigues JuniorICMC-USP
http://publicationslist.org/junio
What is it about?When dealing with two variables, the main interest is to
know if and how they are interrelated
To this end, plotting one variable against the other is thestraightforward course of action Scatter Plots
http://publicationslist.org/junio
Scatter Plots (xy plot) - example
http://publicationslist.org/junio
Scatter Plots (xy plot)Example: typical data as, for instance, the prevalence of skin
cancer as a function of the mean income for group ofindividuals, or the unemployment rate as a function of thefrequency of highschool dropouts
http://publicationslist.org/junio
Scatter Plots (xy plot)Example: typical data as, for instance, the prevalence of skin
cancer as a function of the mean income for group ofindividuals, or the unemployment rate as a function of thefrequency of highschool dropouts
In this example, which is not rare, the plot is not conclusive about the presence of a relationship
http://publicationslist.org/junio
Scatter Plots (xy plot)Typical plots
No relationship Strong, simple relationship
Strong, not-simple relationship Multivariate relationship
http://publicationslist.org/junio
Linear regressionGiven a controlled input variable x, and a corresponding output
response y, we are looking for a linear function f (x) = a + bx =y that reproduces the response with the least amount oferror; a linear regression is a function that minimizes the error inthe responses for a given set of inputs
The technique must not be misunderstood as a summarizationtechnique, but rather as a prediction technique
http://publicationslist.org/junio
Linear regressionThe math behind linear regression is surprisingly simple, what
makes it so popular (and misused, as well); its principle is tominimize (on a and b) the squared difference between theactual data and f(x) = a+bx
With a little algebra, the preferred values for a and b are givenby:
However, linear regression can be misleading
http://publicationslist.org/junio
Linear regressionConsider these four data sets, known as the Anscombe’s quartet:
http://publicationslist.org/junio
Linear regressionAll the four data sets of the Anscombe’s quartet have the same
linear regression, however, they are essentially different
http://publicationslist.org/junio
Linear regressionAll the four data sets of the Anscombe’s quartet have the same
linear regression, however, they are essentially different• The first data set is represented correctly• The second is not linear• The third has an expressive outlier, not embraced by the regression• The fourth does not have enough independent values x in order to
provide a linear regression (only two values: 8.0 and 19.0)
• The problem is even worse, the confidence intervals of thedata sets are all the same as well, so the problem is noticedonly when the data is plotted
To verify a linear regression, a useful exercise is to verifywhere the next response will fall into the plot – it is ok only ifthe response falls in the line defined by the points already known
http://publicationslist.org/junio
Linear regressionUse linear regression only if: the data can be described by a straight line the data is well-behaved, that is, no expressive outliers there are enough values for the controlled variable
In any case, linear regression must be accompanied with ascatter plot so that visual verification is possible
http://publicationslist.org/junio
Dealing with noisy dataWhen the data is noisy, it is often helpful to find a smooth curve that
represents it so that trends and structure can be more easily noticed
Two methods are frequently used: weighted splines (Splines) andlocally weighted regression (LOESS or LOWESS)
Both work by approximating the data in a small neighborhood(locally) by a polynomial of low order (at most cubic), followingan adjustable parameter that controls the stiffness of the curve
The stiffer the curve, the smoother it appears but the less accuratelyit can follow the individual data points balancing smoothnessand accuracy is the challenge here
http://publicationslist.org/junio
SplinesSplines are constructed from piecewise polynomial functions
(typically cubic) that are joined together in a smooth fashion
Cubic interpolation polynomials for each consecutive pair ofpoints and required, so that these individual polynomials have thesame values, as well as the same first and second derivatives, at thepoints where they meet; these smoothness conditions lead to aset of linear equations for the coefficients in thepolynomials, which can be solved and the spline curve can beevaluated at any desired location
http://publicationslist.org/junio
Splines
1st term 2nd term
In addition to the local smoothness requirements at each joint, splines must also satisfy aglobal smoothness condition by optimizing (minimizing) the functional:
where s(t) is the spline curve, (xi, yi) are the coordinates of the two-variables data points, wiare weight factors (one for each point), and 훼 is a mixing factor
The 1st term controls how wiggly the spline is – many wiggles lead to large secondderivatives; the 2nd term captures how accurately the spline represents the datapoints by measuring the squared deviation of the spline from each data point
The wi values can be given by wi=1/풅ퟐ풊, where di measures how close the spline shouldpass by (xi,yi), that is, greater weights for points that the spline should be close(previously chosen pivots, for example)
The 휶 value mixes the importance of the 1st (훼) and the 2nd (1 − 훼) terms, balancingsmoothness and accuracy; high values will avoid wiggly curves, and low values willlead to more precise, though, less sooth curves the main parameter for off-the-shelf plotting software
http://publicationslist.org/junio
Wiggly
Wiggly: more precision, less smoothness Non-wiggly: less precision, more smoothness
http://publicationslist.org/junio
LOESS (locally weighted regression) LOESS consists of approximating the data locally through a low-order
(typically linear) polynomial (regression), while weighting all the data pointsin such a way that points close to the location of interest contributemore strongly than do data points farther away (local weighting)
Its linear case finds parameters a and b that minimize the least-squaresequality:
where a+bxi-yi is the LOESS curve at (xi, yi) and w(x) is the weightfunction – usually a smooth and peaked kernel as퐾 푥 = (1 − |푥| ) 푓표푟 푥 < 1; 0표푡ℎ푒푟푤푖푠푒;Notice how the weighting function is sensible to the distance
between point x and all the other xi pointsLOESS is computationally intensive, as the entire calculation must be
performed for every point at which we want to obtain a smoothed value
http://publicationslist.org/junio
LOESS (locally weighted regression)
As it can be seen, the plot of the pointsshows no evidence of biasing or of anykind of pattern
However, if LOESS is used to representthe data as a smooth curve, it becomesevident that the data is biased
For example, in 1970, men in the USA were drafted based on theirdate of birth following a sequence ranging from 1 to 366 using alottery process
Soon, complaints were raised that the lottery was biased: men bornlater in the year had a greater chance of receiving a low draftnumber, being drafted early
http://publicationslist.org/junio
LOESS (locally weighted regression)
As it can be seen, the plot of the pointsshows no evidence of biasing or of anykind of pattern
However, if LOESS is used to representthe data as a smooth curve, it becomesevident that the data is biased
For example, in 1970, men in the USA were drafted based on theirdate of birth following a sequence ranging from 1 to 366 using alottery process
Soon, complaints were raised that the lottery was biased: men bornlater in the year had a greater chance of receiving a low draftnumber, being drafted early
In the plot, the filled line corresponds to h=5, while the dashed line corresponds to h=100; this large value makes
LOESS behave like a simple linear regression
This example demonstrates that a smoother curve can reveal more details than a stiff curve – such as a
straight line, which provides a global inspection with less details
http://publicationslist.org/junio
LOESS (locally weighted regression)Another example, consider the finishing times for the winners in a
marathon separated by men and women, data from 1900 up to 1990,and prediction points up to 2000+
In this example, the stiffcurves wrongly showthat women should beatmen and continue on adramatic pace
The smooth curvesshow that women timestend to stabilize nearyear 2000
http://publicationslist.org/junio
ResidualsResiduals refer to the remainder when you subtract the
smooth curve from the actual dataThey should be balanced, that is, be symmetrically distributed
around zero, preferably according to a Gaussian distributionwith mean zero
This figure shows theresiduals for themarathon data – onlywomen, for LOESS andlinear regression
LOESS showssmaller values,while the line showsbigger values andan increasing trendfor error
http://publicationslist.org/junio
ResidualsResiduals refer to the remainder when you subtract the
smooth curve from the actual dataThey should be balanced, that is, be symmetrically distributed
around zero, preferably according to a Gaussian distributionwith mean zero
This figure shows theresiduals for themarathon data – onlywomen, for LOESS andlinear regression
LOESS showssmaller values,while the line showsbigger values andan increasing trendfor error
X
Ok
http://publicationslist.org/junio
ResidualsResiduals refer to the remainder when you subtract the
smooth curve from the actual dataThey should be balanced, that is, be symmetrically distributed
around zero, preferably according to a Gaussian distributionwith mean zero
This figure shows theresiduals for themarathon data – onlywomen, for LOESS andlinear regression
LOESS showssmaller values,while the line showsbigger values andan increasing trendfor error
• It is important to analyze the residuals in order toverify the adequacy of the smooth curve
• Good residuals should straddle the zero value allover the data points, and should not present trends as,for instance, increasing or decreasing
• Trends may reveal that the smooth curve is notadequate or that it is adequate only for part of the datadomain
http://publicationslist.org/junio
Logarithmic plotsLogarithmic plots are based on the fundamental properties that
turn products into sums and powers into products
풍풐품 풙풚 = 풍풐품 풙 + 풍풐품 풚
풍풐품 풙풌 = 풌풍풐품 풙
There are single, or semi-logarithmic plots, and double, orlog-log, plots, depending on whether only one or both axes havebeen scaled logarithmically
For example, consider the function y=C*exp(훼x), where C and 훼are constants, its single log plot is given by log y = log C + 훼x, whichis a line with slope 훼
http://publicationslist.org/junio
Logarithmic plotsExample
In the example, 3 functions:f(x)=10x, f(x)=x, andf(x)=log(x)
Observe how the axesscale and how the curvesturn out into lines
http://publicationslist.org/junio
Logarithmic plotsExample: here the use of log permits to compare values that span
over a large range
http://publicationslist.org/junio
Logarithmic plotsDouble logarithmic plots have the ability to reveal power-law
relationships as straight linesExample: consider the heartbeat rate of mammals whose weight
ranges from a few kgs to 120 tons (the whale)
Simple plot Log-log plot
http://publicationslist.org/junio
Logarithmic plotsDouble logarithmic plots have the ability to reveal power-law
relationships as straight linesExample: consider the heartbeat rate of mammals whose weight
ranges from a few kgs to 120 tons (the whale)
Simple plot Log-log plot
• In this example, the log plot reveals a line with slope -1/4,the signature of its underlying power-law distribution
• It means that heart_rate = mass-1/4 (left picture) whoselogarithmic plot is given by log(heart_rate) = -1/4log(mass) picture at the right
http://publicationslist.org/junio
Scaling for better visualizationAnother technique to improve the power of a plot is to scale one,
or both, of its axesFor example, consider a data set of the annual sunspot count from
year 1700 to the year 2000Despite one can see a
cyclic behavior, someimportant details are notevident
http://publicationslist.org/junio
Scaling for better visualizationThe same data set can be better visualized if either the horizontal
axis or the vertical axis is scaled
Vertical-axis scale
Horizontal-axis scale (sliced to fit)Some authors call this technique“banking” (?!)
http://publicationslist.org/junio
Example, modeling two-variable data
http://publicationslist.org/junio
Mass as in function of heightConsider a dataset with two attributes, the height and the mass
of individuals
http://publicationslist.org/junio
Mass as in function of heightWhat about a linear model to represent such data?
The model reasonably models the data, but let’s take a closer look
http://publicationslist.org/junio
Mass as in function of heightWhat about a logarithmic plot?
http://publicationslist.org/junio
Mass as in function of heightWhat about a logarithmic plot?
• Surprisingly, the cubic function represents the data a lot better
• Actually, this is no surprise, the weight is proportional to itsvolume—that is, to height times width times depth or h · w · d,and
• Since body proportions are pretty much the same for all humans –a person who is twice as tall as another will have shoulders thatare twice as wide, too
• It follows that the volume of a person’s body (and hence its mass)scales as the third power of the height: mass ∼ height3
http://publicationslist.org/junio
Mass as in function of heightNow back to the non-logarithmic plot and the cubic model
with final parameters obtained by trial and error
http://publicationslist.org/junio
Mass as in function of heightNow back to the non-logarithmic plot and the cubic model
with final parameters obtained by trial and error
• The models seem a lot better now, but it has somelimitations on small and high heights
• Despite that, it can be reasonably used for prediction andfor understanding the data
http://publicationslist.org/junio
Example, optimizing two-variable data
http://publicationslist.org/junio
Mass as in function of heightConsider a group of people scheduled to perform some task.
The amount of work that this group can perform in a fixedamount of time (its “throughput”) is proportional to thenumber n of people on the team:∼ n
However, the members will have to coordinate with each other.Let’s assume that each member of the team needs to talk toevery other member at least once a day communicationoverhead:∼ -n2 (minus the loss in throughput.)
There is an optimal number of people for which the realizedproductivity will be higher what is this number?
http://publicationslist.org/junio
Mass as in function of heightConsider that the problem can be modeled as:
푃 푛 = 푐푛 − 푛 푑where n is the number of people, c is the number of minutes eachperson can produce per day, and d is the number of minutes of eachcommunication eventGraphically, we can
analyze the problemwith three curves: raw throughput: cncomm. overhead: dn2
P(n)=cn - n2d
http://publicationslist.org/junio
Mass as in function of heightConsider that the problem can be modeled as:
푃 푛 = 푐푛 − 푛 푑where n is the number of people, c is the number of minutes eachperson can produce per day, and d is the number of minutes of eachcommunication eventGraphically, we can
analyze the problemwith three curves: raw throughput: cncomm. overhead: dn2
P(n)=cn - n2d
http://publicationslist.org/junio
Mass as in function of heightConsider that the problem can be modeled as:
푃 푛 = 푐푛 − 푛 푑where n is the number of people, c is the number of minutes eachperson can produce per day, and d is the number of minutes of eachcommunication eventGraphically, we can
analyze the problemwith three curves: raw throughput: cncomm. overhead: dn2
P(n)=cn - n2d
• But what is the best number?
• From the plot we see that there is a local maximum onP(n)
• How to determine such maximum?
http://publicationslist.org/junio
Mass as in function of heightLocal maximums answer for derivatives with value 0, soTo find the maximum, we take the derivative of P(n) set it equal
0, and solve for nThe result is noptimal = c/2d
http://publicationslist.org/junio
Mass as in function of heightP’(n) = c – 2dnc – 2dn = 0n = c/2d
http://publicationslist.org/junio
References Philipp K. Janert, Data Analysis with Open Source Tools,
O’Reilly, 2010. Wikipedia, http://en.wikipedia.org Wolfram MathWorld, http://mathworld.wolfram.com/