Overview of Data Informatics for Big Data Summer 2017 HOMEWORK 2 Due Date: July 12, 2017. 1:00PM 1. A company is investigating the relationship between its advertising expenditures and the sales of their products. The following data represents a sample of 10 products. Note that AD = Advertising dollars and S = Sales in thousands $. 1) Find the equation of the regression line, using Advertising dollars as the independent variable and Sales as the response variable. 2) Plot the scatter diagram and the regression line. 3) Explain how to interpret the slope of the line in this problem. 4) Find r 2 and interpret it in the words of the problem. 5) Use the line to predict the Sales if Advertising dollars = $50 K. ANSWER: Let Sales be represented by y i and Advertising Money be represented by x i then linear regression and coefficient of determination are AD -> x S -> y x - avg(x) y - avg(y) (x-avg(x))^2 Sum((x - avg(x))*(y-avg(y))) 1 22 64 -22.2 -56.2 492.84 1247.64 2 25 74 -19.2 -46.2 368.64 887.04 3 29 82 -15.2 -38.2 231.04 580.64 4 35 90 -9.2 -30.2 84.64 277.84 5 38 100 -6.2 -20.2 38.44 125.24 6 42 120 -2.2 -0.2 4.84 0.44 7 46 120 1.8 -0.2 3.24 -0.36 8 52 142 7.8 21.8 60.84 170.04 9 65 180 20.8 59.8 432.64 1243.84 10 88 230 43.8 109.8 1918.44 4809.24 Sum 442 1202 3635.6 9341.6 Average 44.2 120.2 Part (1) : Look the chart below for answer to part 1. AD S 22 64 25 74 29 82 35 90 38 100 42 120 46 120 52 142 65 180 88 230
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Overview of Data Informatics for Big Data
Summer 2017
HOMEWORK 2 Due Date: July 12, 2017. 1:00PM
1. A company is investigating the relationship between its advertising expenditures and the sales of their products. The following data represents a sample of 10 products. Note that AD = Advertising dollars and S = Sales in thousands $. 1) Find the equation of the regression line, using Advertising dollars as
the independent variable and Sales as the response variable. 2) Plot the scatter diagram and the regression line. 3) Explain how to interpret the slope of the line in this problem. 4) Find r2 and interpret it in the words of the problem. 5) Use the line to predict the Sales if Advertising dollars = $50 K.
Part(3):The slope of the line is positive. This indicates that sales increase with increase in expenditure on advertising. I.e., $1000 ad money more, $2570 sale increases
Part(4):R2 = 0.9927 : This indicates that the liner regression equation fits our
data really well. This means that if we use our equation to predict the amount in sales based on amount spent on advertising, our prediction will be close to actual sales amount.
Part (5) : Using the equation from part (1) : if amount spent on Advertising is
$50k, then sales will be $135.104k 2. Hierarchical Clustering:
Assume we are trying to cluster the following points using hierarchical clustering. If we are using Euclidian distance, draw a sketch of the hierarchical clustering tree (dendrogram) we would obtain for each of the linkage methods (single and complete, respectively)
3. (25 pts) Consider a two dimensional database D with the records : R1 (2, 2), R2 (2, 4), R3(4,
2), R4(4, 4), R5(3, 6), R6(7, 6), R7(9, 6), R8(5, 10), R9(8, 10), R10(10, 10). The distance function is the L1 distance (Manhattan distance). Show the results of the k-means algorithm at each step, assuming that you start with two clusters (k = 2) with centers C1 = (6,6) and C2 = (9,7).
Answer:Thefirststepassignspoints1,2,3,4,5,6,and8toC1andtheotherpointstoC2.Thenewcentersare(3.85,4.85)and(9,8.66).(15pts)Inthenextstep,point8movesfromC1toC2.Thenewcentersare(3.33,4)and(8,9).(5pts)Inthenextstep,point6movesfromC1toC2.Afterthatmovethealgorithmstops.Thefinalclustersarepoints(1,2,3,4,5)and(6,7,8,9,10).(5pts)4. (25 pts) k-Means Clustering: For the following six points,
X Y A1 1.00 2.00 A2 1.00 4.00 A3 3.00 1.00 A4 3.00 5.00 A5 5.00 2.00 A6 5.00 4.00
1) (10 pts) Use the k-means algorithm to show the final clustering result assuming initially we assign A1, A6 as the center of each cluster, respectively.
2) (10 pts) Use the k-means algorithm to show the final clustering result assuming initially we assign A3, A4 as the center of each cluster, respectively.
3) (5 pts) Compute the quality of the K-Means clustering using the Sum of Squared Error (SSE) which shows cohesion measures how near the data points in a cluster are to the cluster centroid. Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the intra-cluster sum of squares.
SSE = (x −µ))+,∈.)
/
)01
where µi is the mean of points in Si.
Based on SSE of 1) and 2), which clustering would be better?