ZhaoHui Tang ZhaoHui Tang Program Manager Program Manager SQL Server Analysis SQL Server Analysis Services Services Microsoft Corporation Microsoft Corporation DAT205 DAT205 Advanced Data Mining Advanced Data Mining Using SQL Server 2000 Using SQL Server 2000
DAT205 Advanced Data Mining Using SQL Server 2000. ZhaoHui Tang Program Manager SQL Server Analysis Services Microsoft Corporation. Agenda. Microsoft Data Mining Algorithms OLE DB for DM Data mining query Data Mining Case Study: Click Stream Analysis Customer Segmentation - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ZhaoHui Tang ZhaoHui Tang Program ManagerProgram Manager SQL Server Analysis ServicesSQL Server Analysis ServicesMicrosoft CorporationMicrosoft Corporation
DAT205DAT205Advanced Data Mining Using Advanced Data Mining Using SQL Server 2000SQL Server 2000
AgendaAgenda
• Microsoft Data Mining AlgorithmsMicrosoft Data Mining Algorithms• OLE DB for DM Data mining queryOLE DB for DM Data mining query• Data Mining Case Study: Click Stream Data Mining Case Study: Click Stream
Analysis Analysis – Customer SegmentationCustomer Segmentation– Site affiliationSite affiliation– Target ads in banner Target ads in banner
• Performance of Microsoft Data Mining Performance of Microsoft Data Mining Algorithm Algorithm
• Q&AQ&A
Data Mining Algorithms in SQL Data Mining Algorithms in SQL Server 2000Server 2000
Decision TreeDecision Tree• Popular technique for Popular technique for
• A popular method for customer A popular method for customer segmentation, mailing list, profiling…segmentation, mailing list, profiling…
• Algorithm processAlgorithm process– Assign a set of Initial PointsAssign a set of Initial Points– Assign initial cluster to each pointsAssign initial cluster to each points– Assign data points to Assign data points to each clustereach cluster with a with a
probabilityprobability– Computer new central point based on Computer new central point based on weighted weighted
computation computation – Cycle until convergenceCycle until convergence
EM IllustrationEM Illustration
X
X
X
Microsoft Clustering Algorithm Microsoft Clustering Algorithm (Scalable EM)(Scalable EM)
Data
Fill BufferBuild/Update
Model
Compressed date Sufficient stats
Identify Data to be Compressed
Stop?
Final Model
OLE DB for Data MiningOLE DB for Data Mining
OLE DB for DMOLE DB for DM• Industry standard for data miningIndustry standard for data mining• Based on existing technologiesBased on existing technologies
– SQLSQL– OLE DBOLE DB
• Define common concepts for DMDefine common concepts for DM– Case, Nested CaseCase, Nested Case– Mining ModelMining Model– Model CreationModel Creation– Model TrainingModel Training– Prediction Prediction
• Language based API Language based API
Customer TableCustomer TableCustomer ID Profession Income Gender Risk
• Tabular data to provide meta data Tabular data to provide meta data informationinformation
• List of Schema Rowsets in OLE DB for DMList of Schema Rowsets in OLE DB for DM– Mining_ServicesMining_Services– Mining_Service_ParametersMining_Service_Parameters– Mining_ModelsMining_Models– Mining_ColumnsMining_Columns– Mining_Model_ContentsMining_Model_Contents– Model_Content_PMMLModel_Content_PMML
Mining Model Contents Schema Mining Model Contents Schema RowsetsRowsets
Topcount((select URLCategory, $adjustedProbability as Topcount((select URLCategory, $adjustedProbability as prob prob
From Predict([Web Click], INCLUDE_STATISTICS, From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) EXCLUSIVE)), prob, 5)
FromFrom
WebLog PREDICTION JOIN (select (select 'Business' WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as inputURLCategory) as WebClick) as input
Performance of DM AlgorithmsPerformance of DM Algorithms
DM Performance Study DM Performance Study
• Joint effort between Unisys & MicrosoftJoint effort between Unisys & Microsoft• Two parts of the white paper:Two parts of the white paper:
First part:First part: Use AS2k to build DM Models for Use AS2k to build DM Models for a a banking business scenario banking business scenario
Second Part:Second Part: Performance results of DM Performance results of DM algorithms studyalgorithms study
• Some results in this session…Some results in this session…• Details in the Details in the paperpaper and and SQL Server SQL Server
magazinemagazine articles… articles…
Data Source for DMMsData Source for DMMs
Training Performance Results…Training Performance Results…
Sample Business Question for Sample Business Question for Non Nested MDTNon Nested MDT
11 Identify those customers that are Identify those customers that are most likely to churn (leave) based most likely to churn (leave) based on customer demographical on customer demographical information.information.
Non Nested: Training Times for varying Number of Input attributesNon Nested: Training Times for varying Number of Input attributes
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
0 50 100 150 200 250
Number of Attributes
Trai
ning
Tim
e (m
inut
es)
Training Time
Assumptions:Assumptions:• 1 mm cases• 25 states• 1 predictable attribute
I/P AttributesI/P Attributes Training TimeTraining Time
1010 4.084.08
2020 7.277.27
5050 31.5431.54
100100 40.5540.55
200200 129.35129.35
Observations:Observations:
Non Nested: Training Times for varying Number of CasesNon Nested: Training Times for varying Number of Cases
Sample Business Question for Sample Business Question for Nested MDTNested MDT
22 Find the list of other products that the Find the list of other products that the customer may be interested in based on the customer may be interested in based on the products the customer has purchased.products the customer has purchased.
Nested Cases: Training Times for varying Sample size of Case TableNested Cases: Training Times for varying Sample size of Case Table
Training Time
0
50
100
150
200
250
300
0 50000 100000 150000 200000 250000
Number of Master Cases
Trai
ning
Tim
e (m
inut
es)
Training Time
Assumptions:Assumptions:• Avg. customer
purchases=25• States in nested=200• Nested key predictable
Observations:Observations:
Master CasesMaster Cases Training Training TimeTime
10,00010,000 15.0915.09
50,00050,000 67.7967.79
100,000100,000 120.88120.88
200,000200,000 240.62240.62
Nested Cases: Training Times for varying Number of Products Nested Cases: Training Times for varying Number of Products purchased per customerpurchased per customer
Assumptions:Assumptions:• 200000 cases• 1000 products in nested
Observations:Observations:
Nested CasesNested Cases Training Training TimeTime
Don’t forget to complete the Don’t forget to complete the on-line Session Feedback form on-line Session Feedback form on the Attendee Web siteon the Attendee Web site