Top Banner
Big Data Analytic on Clickstream Data Group Name: Big Party Group Member: Taineng Lu Zheming Yuan Shenglei Zhao Tianmiao Zhou Instructor: Prof. Dobin Yim
18

Clickstream Big Party PPT

Feb 10, 2017

Download

Documents

Shenglei Zhao
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clickstream Big Party PPT

Big Data Analytic on Clickstream Data

Group Name: Big Party Group Member: Taineng Lu

Zheming Yuan Shenglei Zhao

Tianmiao ZhouInstructor: Prof. Dobin Yim

Page 2: Clickstream Big Party PPT

Introduction

• Objective:

Predict customer’s purchasing decision

Cluster customers to help marketing target

• Source data

• Clickstream data

• Product Purchase history

• Demographic data

Page 3: Clickstream Big Party PPT

Problem Statement

• Focuses on utilizing the Amazon Web Service to manipulate the dataset, and apply data mining on variables related to user’s browsing behavior as well as user’s demographic variables to explore insights from users pool

• Hypothesis: Customers’ purchase decision is determined by their browsing habit and demographic profile

Page 4: Clickstream Big Party PPT

Data Manipulation• Three data sets:

• User’s browsing session data

• User’s demographic data

• User’s purchasing data

• Source data is organized by making each row a summary of session

• Analysis based on

• Defining the combination of each unique user and date a “user-date”

• Grouping data by each “user-date” and analyzed average duration per session, average number of page viewed per session, duration per page, percentage of empty records in reference domain and percentage of search engine records in reference domain

Page 5: Clickstream Big Party PPT

Data Manipulation

• Data manipulation was done through Hive by creating tables to record each file and joining several tables to combine all variables together

Original Sample Data Sample Hive Scripts for analysis & data manipulation

Page 6: Clickstream Big Party PPT

Variables

Page 7: Clickstream Big Party PPT
Page 8: Clickstream Big Party PPT
Page 9: Clickstream Big Party PPT
Page 10: Clickstream Big Party PPT
Page 11: Clickstream Big Party PPT

Amazon Web Service Machine Learning

1.Latest cloud machine learning as service

2.Loading Problem

3.Automatic Algorithm

Page 12: Clickstream Big Party PPT

Amazon Web Service Machine Learning

Sample Prediction Model Result.

Page 13: Clickstream Big Party PPT

Machine Learning Results in SPSS Modeler

Page 14: Clickstream Big Party PPT

Machine Learning Results in SPSS Modeler

1.Accuracy for testing data is 67.12%, which is not at a satisfying level.

Model needs improvement on variables or algorithm choice.

2. Balanced data has 52.55% portion of records with target ‘0’. Model

still has increased the accuracy of prediction compared to a

predictor which predicts majority class always.

Page 15: Clickstream Big Party PPT

Machine Learning Results in SPSS Modeler

1.From the predictor importance graph, page viewed per session is the most important factor in prediction.

2.Browsing behavior pattern related data has more influence on prediction than demographic ones

Page 16: Clickstream Big Party PPT

Machine Learning Results in SPSS Modeler

1.Clustering model used K-means algorithm and finally got 3 groups with quality score of 0.6, at a good level.

2.Predictor Importance graph shows the most important factors differentiate customers are percentage of search engine and percentage of empty records in reference domain.

Page 17: Clickstream Big Party PPT

Cluster Information1. Cluster 1 has relatively high percentage of empty records and extremely low percentage of search engine. It indicates that they might be direct buyer or knowledge builder, who knows where to find target information or products.

2. Cluster 3 has relatively low percentage of empty records and high percentage of search engine. It indicates that they might be hedonic browsers or searchers who depend on search engine to find interesting information.

Page 18: Clickstream Big Party PPT

Implication

1.Selected variables can help us improve accuracy of prediction and gain some insights into behavior pattern of customers

2.More variables can be added into prediction and clustering to measure other aspects of customers. (e.g. variety of page/domain/reference domain categories, returning records in reference domain)

3.More detailed target class could be applied, instead of buyer and non-buyer. (e.g. Hedonic browser, searcher, direct buyers and knowledge builder)