Transcript
Data MiningData MiningUsing IBM Intelligent MinerUsing IBM Intelligent Miner
Presented by: Presented by:
Qiyan (Jennifer ) HuangQiyan (Jennifer ) Huang
OutlineOutline
• Introduction Introduction
• Mining ProcessMining Process
• Main Functionalities of Intelligent Main Functionalities of Intelligent MinerMiner
• Other Data Mining ProductsOther Data Mining Products
• Data Mining and Privacy Data Mining and Privacy
• SummarySummary
• ReferencesReferences
What is Data MiningWhat is Data Mining
• Data miningData mining: : discovering interesting discovering interesting patterns from large amounts of datapatterns from large amounts of data– Knowledge discovery (mining) in databases Knowledge discovery (mining) in databases
(KDD), data/pattern analysis, information (KDD), data/pattern analysis, information harvesting, business intelligence, etcharvesting, business intelligence, etc..
Evolution of Database Evolution of Database TechnologyTechnology• 1960s:1960s:
– Data collection, database creationData collection, database creation
• 1970s: 1970s: – Relational data model, relational DBMS Relational data model, relational DBMS
implementationimplementation
• 1980s ~ present: 1980s ~ present: – RDBMS, advanced data models RDBMS, advanced data models 1990s—2000s: 1990s—2000s: – Data mining and data warehousing, multimedia Data mining and data warehousing, multimedia
databases, and Web databasesdatabases, and Web databases
Data Mining VS. Database Data Mining VS. Database QueryQuery• DatabaseDatabase
• Data MiningData Mining
– Find all customers who have purchased milkFind all customers who have purchased milk
– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)
– Identify customers who have purchased more Identify customers who have purchased more than $10,000 in the last month.than $10,000 in the last month.
– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)
Data Mining Process (KDD)Data Mining Process (KDD)
Data Cleaning
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
J. Han. and M. Kamber. Data Mining: J. Han. and M. Kamber. Data Mining: Concepts and Techniques,2001Concepts and Techniques,2001
About DB2 Intelligent MinerAbout DB2 Intelligent Miner
• DB2 Intelligent Miner for DataDB2 Intelligent Miner for Data ““focused on the large-scale mining, such focused on the large-scale mining, such as large volumes of data, parallel data as large volumes of data, parallel data mining on Windows NT, Sun Solaris, and mining on Windows NT, Sun Solaris, and OS/390OS/390” ” – – IBMIBM
Main FunctionalitiesMain Functionalities
• Cluster analysisCluster analysis– Group the data that share similar trends Group the data that share similar trends
and patternsand patterns
• Classification Classification – Predict the outcome based on historical Predict the outcome based on historical
datadata
• Association analysisAssociation analysis – Finding frequent patternsFinding frequent patterns..
age income studentcreditrating
buyscomputer
<=30 high no fair<=30 high no excellent31…40 high no fair>40 medium no fair>40 low yes fair>40 low yes excellent31…40 low yes excellent<=30 medium no fair<=30 low yes fair>40 medium yes fair<=30 medium yes excellent31…40 medium no excellent31…40 high yes fair
This follows an example from Quinlan’s ID3
ClassificationClassification
ClassificationClassification
age income studentcreditrating
buyscomputer
<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes
This follows an example from Quinlan’s ID3
ClassificationClassification
AssociationAssociation
– Association Rule: Association Rule: identifies identifies relationshipsrelationships
– ExampleExample “ “30% customers buy shirts in all the 30% customers buy shirts in all the
transactions, 60% of these transactions, 60% of these customers customers
will also by a tie” will also by a tie” •Confidence factor is 60%Confidence factor is 60%•Support – Support – if buying shirt and tie together is if buying shirt and tie together is
observed in 12% of all transactions, then the observed in 12% of all transactions, then the support is thus 12%support is thus 12%
•Lift = 60% Lift = 60% // 30%=2 30%=2
AssociationAssociation
Support Confidence Type Lift Rule Body Rule Head Support Confidence Type Lift Rule Body Rule Head (%) (%)(%) (%)
5.52865.5286 34.0800 + 2.7300 [203] + [1207] => [1716] 34.0800 + 2.7300 [203] + [1207] => [1716]
7.03887.0388 34.1300 + 2.7400 [203] + [1719] 34.1300 + 2.7400 [203] + [1719] => [1716]=> [1716]
5.46625.4662 34.1700 + 2.7400 [202] + [802] 34.1700 + 2.7400 [202] + [802] => [1716]=> [1716]
5.88055.8805 34.3400 + 2.7500 [203] + [802] 34.3400 + 2.7500 [203] + [802] => [1716]=> [1716]
5.01635.0163 34.4900 + 2.7600 [203] + [705] 34.4900 + 2.7600 [203] + [705] => [1716]=> [1716]
7.12797.1279 34.7400 + 2.7800 [202] + [1718] 34.7400 + 2.7800 [202] + [1718] => [1716]=> [1716]
5.8226 34.7600 + 3.3900 [711] + [203]5.8226 34.7600 + 3.3900 [711] + [203] => [710]=> [710]
5.06975.0697 34.8300 + 2.7400 [202] + [1702] 34.8300 + 2.7400 [202] + [1702] => [1703]=> [1703]
5.28365.2836 34.8300 + 2.7400 [202] + [1207] 34.8300 + 2.7400 [202] + [1207] => [1703]=> [1703]
5.43505.4350 34.9400 + 3.4100 [201] + [711] 34.9400 + 3.4100 [201] + [711] => [710]=> [710]
5.34595.3459 35.0200 + 2.7600 [201] + [1702] 35.0200 + 2.7600 [201] + [1702] => [1703]=> [1703]
Data Mining ProductsData Mining Products
• more than 50 commercial data mining toolsmore than 50 commercial data mining tools
• Wide range of pricing Wide range of pricing – SAS Institute’s Enterprise Miner ~ $80kSAS Institute’s Enterprise Miner ~ $80k– SPSS Inc. Clementine ~ 75KSPSS Inc. Clementine ~ 75K– IBM Intelligent Miner ~ $60kIBM Intelligent Miner ~ $60k– Desktop products start at few hundred dollarsDesktop products start at few hundred dollars
Data Mining ProductsData Mining Products
AlgorithmAlgorithm IBMIBM SASSAS SPSSSPSS
Neural Neural NetworkNetwork
√√ √√ √√
Decision TreeDecision Tree √√ √√ √√
Clustering Clustering √√ √√
AssociationAssociation √√ √√
Nearest Nearest NeighbourNeighbour
√√
Kohonen Self- Kohonen Self- Organizing Organizing
MapMap
√√ √√
Data Ming Product Comparison on Algorithm
Data Mining & PrivacyData Mining & Privacy
• Release limited subset of dataRelease limited subset of data– Hide attributes that potentially related Hide attributes that potentially related
to personal informationto personal information
• Release Encrypted DataRelease Encrypted Data
• Audit to detect misuse of DataAudit to detect misuse of Data
• Set up Data Mining ControllerSet up Data Mining Controller
SummarySummary
• Introduction to Data MiningIntroduction to Data Mining
• A KDD Data Mining ProcessA KDD Data Mining Process
• Functionalities of Intelligent MinerFunctionalities of Intelligent Miner
• Commercial Data Mining ToolsCommercial Data Mining Tools
• Data Mining & PrivacyData Mining & Privacy
ReferencesReferencesAngoss Whitepaper:Angoss Whitepaper:
http://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.hthttp://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.html.ml. Retrieved on Oct26th,2003Retrieved on Oct26th,2003
C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High-end
Data Mining ToolsData Mining ToolsElder Research. Elder Research. http://www.rgrossman.com/faq/dm-02.htmhttp://www.rgrossman.com/faq/dm-02.htm. . Retrieved on Retrieved on
Oct28th,2003Oct28th,2003IBM. BD2 Intelligent Mine. IBM. BD2 Intelligent Mine. http://www-3.ibm.com/software/data/iminer/http://www-3.ibm.com/software/data/iminer/. . Retrieved on Oct26th,2003Retrieved on Oct26th,2003J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data
Mining ToolsMining ToolsJ. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000J. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000http://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Rehttp://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Retrieved on Nov 10th,2003trieved on Nov 10th,2003
Robert GrossmanRobert Grossman http://http://www.datamininglab.com/toolcomp.html#comparisonwww.datamininglab.com/toolcomp.html#comparison. . Retrieved on Retrieved on Oct20th,2003Oct20th,2003
SPSS. SPSS. http://http://www.spss.comwww.spss.com//.. Retrieved on Nov12th,2003 Retrieved on Nov12th,2003
Evolution of Database Evolution of Database TechnologyTechnology• 1960s:1960s:
– Data collection, database creation, and network Data collection, database creation, and network DBMSDBMS
• 1970s: 1970s: – Relational data model, relational DBMS Relational data model, relational DBMS
implementationimplementation
• 1980s: 1980s: – RDBMS, advanced data models RDBMS, advanced data models 1990s—2000s: 1990s—2000s: – Data mining and data warehousing, multimedia Data mining and data warehousing, multimedia
databases, and Web databasesdatabases, and Web databases
Data Mining: Data Mining: On What Kind of Data?On What Kind of Data?
• Data SourcesData Sources– Relational databaseRelational database– Data warehousesData warehouses– Transactional databasesTransactional databases– WWWWWW
• Data typesData types– AudioAudio– ImageImage– TextText
Output: A Decision Tree Output: A Decision Tree for “for “buys_computer”buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
30..40
Neural networkNeural network
k-
f
weighted sum
Inputvector x
output y
Activationfunction
weightvector w
w0
w1
wn
x0
x1
xn
0.15
0.29
0.11
0.25
0.09
0.230.32
0.27
n
jjjii outputwinput
1
iinputgaini eoutput
1
1
Neural networkNeural network
Neural networkNeural network
Applications of Clustering Applications of Clustering
• Pattern RecognitionPattern Recognition
• Image ProcessingImage Processing
• Economic Science (especially market Economic Science (especially market research)research)
• WWWWWW– Document classificationDocument classification– Cluster Weblog data to discover groups of Cluster Weblog data to discover groups of
similar access patternssimilar access patterns
Data Mining & PrivacyData Mining & Privacy
Data Mining Tool
Mining Controller
Data warehouse
Examples of Clustering Examples of Clustering ApplicationsApplications
• Marketing:Marketing: Help marketers discover distinct groups Help marketers discover distinct groups in their customer bases, and then use this in their customer bases, and then use this knowledge to develop targeted marketing knowledge to develop targeted marketing programsprograms
• Insurance:Insurance: Identifying groups of motor insurance Identifying groups of motor insurance policy holders with a high average claim costpolicy holders with a high average claim cost
• City-planning:City-planning: Identifying groups of houses Identifying groups of houses according to their house type, value, and according to their house type, value, and geographical locationgeographical location
• Earth-quake studies:Earth-quake studies: Observed earth quake Observed earth quake epicenters should be clustered along continent epicenters should be clustered along continent faultsfaults
AssociationAssociation
Association and pattern analysisAssociation and pattern analysis– Applications:Applications:
•Basket data analysis, cross-marketing, Basket data analysis, cross-marketing, catalog design, loss-leader analysis, catalog design, loss-leader analysis, clustering, classification, etcclustering, classification, etc..
– Examples.Examples. •buys(x, “diapers”) buys(x, “diapers”) buys(x, “beers”) buys(x, “beers”)
[0.5%, 60%][0.5%, 60%]•major(x, “CS”) ^ takes(x, “DB”) major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]grade(x, “A”) [1%, 75%]
Data Mining: Data Mining: On What Kind of Data?On What Kind of Data?
• Relational databasesRelational databases
• Data warehousesData warehouses
• Transactional databasesTransactional databases
• Advanced DB and information repositoriesAdvanced DB and information repositories– Object-oriented and object-relational Object-oriented and object-relational
databasesdatabases– Text databases and multimedia databasesText databases and multimedia databases– Heterogeneous and legacy databasesHeterogeneous and legacy databases– WWWWWW
Steps of a KDD Steps of a KDD ProcessProcess
• Learning the application domain:Learning the application domain:– relevant prior knowledge and goals of applicationrelevant prior knowledge and goals of application
• Creating a target data set: data selectionCreating a target data set: data selection• Data cleaningData cleaning and preprocessing: (may take 60% of and preprocessing: (may take 60% of
effort!)effort!)• Data reduction and transformationData reduction and transformation::
– Find useful features, dimensionality/variable reduction, Find useful features, dimensionality/variable reduction, invariant representation.invariant representation.
• Choosing functions of data mining Choosing functions of data mining – summarization, classification, regression, association, summarization, classification, regression, association,
clustering.clustering.
• Choosing the mining algorithm(s)Choosing the mining algorithm(s)• Data miningData mining: search for patterns of interest: search for patterns of interest• Pattern evaluation and knowledge presentationPattern evaluation and knowledge presentation
– visualization, transformation, removing redundant patterns, visualization, transformation, removing redundant patterns, etc.etc.
• Use of discovered knowledgeUse of discovered knowledge
Strength and Weakness Strength and Weakness
StrengthStrength– Algorithm breadth Algorithm breadth – Graphical outputGraphical output– Available for PC and mainframe Available for PC and mainframe
environmentenvironment
WeaknessWeakness– No automationNo automation– Data has to reside in IBM’s database systemData has to reside in IBM’s database system
top related