CSC323 – Week 3 CSC323 – Week 3 Outline Quiz Associations between two variables • Scatter plots • Correlation coefficient Linear regression analysis
Jan 01, 2016
CSC323 – Week 3CSC323 – Week 3
Outline
Quiz
Associations between two variables
• Scatter plots
• Correlation coefficient
Linear regression analysis
Association between two variablesAssociation between two variables
Example: University fees for the Big Ten UniversitiesData were collected to study the association between the percentage of students that were from out of state and the tuition paid by nonresidents students (in thousand dollars).
Does the tuition money increase with the percentage of non residents students?
University Tuition (1,000$)
(Y)
Nonresidents (%) (X)
Northwestern
16.4 72
Illinois 7.6 8
Minnesota 8.7 23
Ohio State 9.3 9
Penn State 10.7 18
Purdue 9.6 27
Indiana 10.2 29
Iowa 8.6 31
Wisconsin 9.1 35
Michigan 15.9 30
Michigan State
10.5 9
Example:Example: Size of diamond and price of ring
The source of the data is a full page advertisement placed in the Straits Times newspaper issue of February 29, 1992, by a Singapore-based retailer of diamond jewelry.The variables are the size of the diamond in carats (1 carat = .2 gram) and the price of ladies’ rings (single diamond stone) in Singapore dollars.
Carats Singapore dollars
.17 355
.16 328
.17 350 .18 325.25 642 ……. …..
How would you describe the association between the two variables?
Association between variablesAssociation between variables
Data are collected for the two variables on each individual/unit.
Two variables are associated if changes in one variable correspond to changes in the second variable.
If there is a strong association, knowing one variable helps predicting the other.
Number of programs running and CPU usage
If the association is weak, information about one variable is not very useful in studying the other.
Number of users and CPU usage
Useful terminologyUseful terminology
The following terms are often used:
Response variable: measures the outcome of the study(Dependent variable)
Explanatory variable: explains or causes changes in the response variable(Independent variable)
Can you identify this distinction in the examples shown earlier?
1) Tuition = Response variable Non-residents=Explanatory variable
2) Carat=Explanatory variable Price=Response variable
Scatter plots: displaying data about two Scatter plots: displaying data about two variablesvariablesScatter plots show the relationship between two quantitative variables.One variable (independent variable) appears on the x-axis (horizontal axis) and the dependent variable appears on the y-axis (vertical axis). Each observation is represented by a point in the plot.
Tuition
Non
resi
dent
st
uden
ts
NWU
UMich
Interpreting scatter plotsInterpreting scatter plots
1. Look for the overall pattern and for striking deviations
2. Define form, direction and strength of the relationship:a. Form: roughly linear if the points follow a straight line
or nonlinear…b. Direction: positive or negative?c. Strength: how closely the points follow a clear form
3. Check for the presence of outliers, individual values that fall outside the overall pattern
4. Two variables are positively (negatively) associated if the increase of one variable correspond to an increase (decrease) in the other variable.
2000 Presidential Elections2000 Presidential Elections
Did the butterfly ballots confuse voters? Did voters for Al Gore instead cast their votes for other candidates?
Bush spokesman Ari Fleishcher stated on Nov. 9 that "Palm Beach County is a Pat Buchanan stronghold and that's why Pat Buchanan received 3,407 votes there."
What is the level of support that Pat Buchanan enjoys in Palm Beach County?The published election results show the association between the vote totals for Pat Buchanan and the total population for Florida counties.
Is the association positive or negative? Is the form of the relationship almost linear?
Example: House data in Albuquerque (NM) Example: House data in Albuquerque (NM) in 1993in 1993
Selling price (100$)
Ann
ual T
axes
($
)
Interpret the graph: form, direction & strength of the relationship
Another example: The statistics of poverty Another example: The statistics of poverty and inequalityand inequalityData from U.N.E.S.C.O. 1990 Demographic Year Book .For 97 countries in the world, data are given for birth rates and for an index of the Gross National Product.
The plot before shows a non-linear association! Sometimes we can make it linear, by using some transformations on the variables. Possible transformations are, for example, “ln”, “exp”, “sqrt”. Here we consider the ln(GNP)=natural log of GNP.
Birth rate(1,000 pop)
Log G.N.P.
Measure of Linear AssociationMeasure of Linear Association
If there is a strong linear association between the variables, then the cloud of points on the scatter plot will be close to a line.
Birth rate
(1,000 pop)
Log G.N.P.
The Correlation Coefficient rThe Correlation Coefficient r
The correlation coefficient r measures the direction and the strength of the linear relationship between two variables.
• It is a value between –1 and 1• The closer r is to 1 or –1, the stronger the linear association is. • Positive values of r imply a positive association, negative values imply a
negative association• Values of r close to 0 imply weak linear association.
It is defined as
y
i
x
i
s
yy
s
xx
nr
1
1
Where X has average and standard deviation sx, and Y has average and standard deviation sy.
xy
Examples of correlationExamples of correlation
Birth rate (1,000 pop)
Log G.N.P.
r = -0.74
Selling price (100$)
Ann
ual T
axes
($)
r=0.65
Negative association
Positive association
Diamond rings dataDiamond rings data
Carat
Price in US dollars
N=48 Average s.d. Min Max
X Carat 0.20 0.056 0.12 0.35
Y Price in US $
865.144 213.64
385 1879
Strong positive association
r = 0.989
Diamond carats vs Price in US$
Positive CorrelationPositive CorrelationIn each plot there are 100 points. The correlation coefficient measures the amount of clustering around a line
If r is close to 1, then points lie close to a straight line!!
Negative CorrelationNegative Correlation
Negative correlation: as x increases, y tends to decrease.
If r is close to – 1, then points lie close to a straight line!!
Guess the correlationGuess the correlation
Match the diagrams with the following correlations: – 0.93 – 0.75 –0.20 0.27 0.63 1.0
Change of scaleChange of scale
These are the low and high temperatures in Boulder (CO) for the month of April 1996. The first scatter plot uses degrees in Fahrenheit and the second plot uses degrees in centigrade. Notice that Co = 5/9*(Fo – 32)
Are the correlations between low and high temperatures in the two graphs different?
r = 0.74 r = ?
Different correlations?Different correlations?
In which diagram below is the correlation coefficient the largest? The smallest?
Outliers and nonlinear associationOutliers and nonlinear association
How are the data sets different?
Plot the data: the nature of the association between x and y is very different. The correlation coefficient can be misleading in presence of outliers or non-linear association. Check the scatter plot of the data
Perfect association!Why is r not equal to 1?
Outliers change the value of r. What would the value of r be without the outliers?
r = 0.82
Which of the following diagrams should be summarized by r?
(1) (2) (3)
Ecological CorrelationsEcological Correlations
Ecological correlations are based on rates or averages. They can be misleading as they tend to overstate the strength of the association. The following example deals with the relationship between income and education level for individuals in 3 states (A, B, C).
This shows the averages.The correlation is almost 1!!
This shows individual data. The correlation is now moderate.Variability within each state!!!
SummarySummary
The correlation coefficient r varies between –1 and 1. If r=0 means there no linear association between X and Y. If r=1 or –1, then the points in a scatter plot lie on a straight line.
Positive r indicates positive association between X and Y. Negative r indicates negative association between X and Y. Both variables X and Y must be quantitative. The correlation coefficient between X and Y is the same as the correlation between Y and X
r does not change if we change the units of measurement for X and Y
The correlation measures only the linear relationship between two variables
r can be strongly affected by the presence of outliers.
Correlation does not mean Causation!!Correlation does not mean Causation!!
The correlation between teachers’ salaries and the consumption of alcohol over a period of years turned out to be almost 0.90. Do the teachers drink?
Both variables moved together, because both are influenced by a third variable (confounding variable) which is the long run growth in national income and population.
A "bad example“ published in The New York Times' weekly science supplement called "Science Times" on August 22, 1989. It stated, "The experts have also developed startling evidence of the cat's renowned ability to survive, this time in the particular setting of New York City, where cats are prone at this time of year to fall from open windows in tall buildings. Researchers call the phenomenon feline high-rise syndrome." "Even more surprising, the longer the fall, the greater the chance of survival. Only one of 22 cats that plunged from above 7 stories died, and there was only one fracture among the 13 that fell more than 9 stories.
The following graph displays the number of radios in the U.K. form 1924 to 1937 and the number of mental defectives for 10,000 people for the same years.
A social scientist states: “as more people gave up intellectual pursuits like readings for listening to the radio, general atrophy of the brain set in and lead to increased mental disability” ?!?!?!
Data miningData mining
Search for patterns and associations in very large databases, that are hidden in vast amount of data.
For instance: Market basket data purchases recorded by the cash
scanners of a national retail chainWeb logs data Logs of the visits to a certain websiteExploratory data analysis techniques are used to discover
information from huge datasets!
Because of the very large dimension of the datasets, efficient algorithms are necessary to “mine” the data.
Data mining is cross-disciplinary: statistical methods made efficient by computer scientists!
Correlation is often used in data mining to to construct the “association rules”, i.e. to learn about the associations among variables.
Association is often confused with causation in data mining!
A supermarket manager observes that there is a strong positive correlation between the sales of hamburgers and hotdogs, and between the sales of hotdogs and barbecue sauce.
He decides to sell hotdogs at a large discount, hoping to increase profit by simultaneously raising the price of the barbecue sauce.
What is the causal model (cause& effect) that is assumed by the manager?
Will the manager make money on this sale?