This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page Number:
1
Issues in Data Mining Issues in Data Mining InfrastructureInfrastructure
Issues in Data Mining Issues in Data Mining InfrastructureInfrastructure
Data Mining in the NutshellData Mining in the NutshellData Mining in the NutshellData Mining in the Nutshell
Uncovering the hidden knowledge
Huge n-p complete search space
Multidimensional interface
NOTICE:
All trademarks and service marks mentioned in this document are marks of their respective owners. Furthermore CRISP-DM consortium (NCR Systems Engineering Copenhagen (USA and Denmark), DaimlerChrysler AG (Germany), SPSS Inc. (USA) and OHRA Verzekeringen en Bank Groep B.V (The Netherlands)) permitted presentation of their process model.
Page Number:
3
A Problem …A Problem …A Problem …A Problem …
You are a marketing manager for a cellular phone company
Problem: Churn is too high
Bringing back a customer after quitting is both difficult and expensive
Giving a new telephone to everyone whose contract is expiring is expensive
You pay a sales commission of 250$ per contract
Customers receive free phone (cost 125$)
Turnover (after contract expires) is 40%
Page Number:
4
… … A SolutionA Solution… … A SolutionA Solution
Three months before a contract expires, predict which customers will leave
If you want to keep a customer that is predicted to churn, offer them a new phone
The ones that are not predicted to churn need no attention
If you don’t want to keep the customer, do nothing
How can you predict future behavior?
Tarot Cards?
Magic Ball?
Data Mining?
Page Number:
5
Still Skeptical?Still Skeptical?Still Skeptical?Still Skeptical?
Page Number:
6
The DefinitionThe DefinitionThe DefinitionThe Definition
Automated
The automated extraction of predictive information from (large) databases
Extraction
Predictive
Databases
Page Number:
7
History of Data MiningHistory of Data MiningHistory of Data MiningHistory of Data Mining
Page Number:
8
Repetition in Solar ActivityRepetition in Solar ActivityRepetition in Solar ActivityRepetition in Solar Activity
1613 – Galileo Galilei
1859 – Heinrich Schwabe
Page Number:
9
The Return of theThe Return of theHalley CometHalley Comet
The Return of theThe Return of theHalley CometHalley Comet
1910 1986 2061 ???
1531
1607
1682
239 BC
Edmund Halley (1656 - 1742)
Page Number:
10
Data Mining is NotData Mining is NotData Mining is NotData Mining is Not
Data warehousing
Ad-hoc query/reporting
Online Analytical Processing (OLAP)
Data visualization
Page Number:
11
Data Mining isData Mining isData Mining isData Mining is
Automated extraction of predictive informationfrom various data sources
Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines
Page Number:
12
Data Mining canData Mining canData Mining canData Mining can
Answer question that were too time consuming to resolve in the past
Predict future trends and behaviors, allowing us to make proactive, knowledge driven decision
Page Number:
13
Data Mining ModelsData Mining ModelsData Mining ModelsData Mining Models
If balance>100.000 then confidence=HIGH & weight=1.7
If balance>25.000 andstatus=married
then confidence=HIGH & weight=2.3
If balance<40.000 then confidence=LOW & weight=1.9
Page Number:
23
K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)
K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)
Usage of knowledge of previously solved similar problems in solving the new problem
Assigning the class to the group where most of the k-”neighbors” belong
First step – finding the suitable measure for distance between attributes in the data
+ Easy handling of non-standard data types
- Huge models
Page Number:
24
K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)
K-nearest Neighbor and K-nearest Neighbor and Memory-Based Reasoning (MBR)Memory-Based Reasoning (MBR)
Page Number:
25
Data Mining AlgorithmsData Mining AlgorithmsData Mining AlgorithmsData Mining Algorithms
Logistic regression
Discriminant analysis
Generalized Adaptive Models (GAM)
Genetic algorithms
The Apriori algorithm
Etc…
Many other available models and algorithms
Many application specific variations of known models
Final implementation usually involves several techniques
Page Number:
26
The Apriori AlgorithmThe Apriori Algorithm
The task – mining association rules by finding large itemsets and The task – mining association rules by finding large itemsets and translating them to the corresponding association rules;translating them to the corresponding association rules;
A A B, or A B, or A11 AA22 …… A Am m BB1 1 B B22 …… B Bnn, where A , where A B = B = The terminologyThe terminology
– ConfidenceConfidence
– SupportSupport
– k-itemset – a set of k-itemset – a set of kk items; items;
– Large itemsets – the large itemset {A, B} corresponds to the following rules Large itemsets – the large itemset {A, B} corresponds to the following rules (implications): A (implications): A B and B B and B A; A;
Page Number:
27
The The Apriori AlgorithmApriori Algorithm
The The operator definition operator definition– n = 1: Sn = 1: S22 = S = S11 S S11 = {A}, {B}, {C}} = {A}, {B}, {C}} {{A}, {B}, {C}} = {{AB}, {AC}, {BC}} {{A}, {B}, {C}} = {{AB}, {AC}, {BC}}
– n = k: Sn = k: Sk+1k+1 = S = Skk S Skk = {X = {X Y| X, Y Y| X, Y S Skk, |X , |X Y| = k-1} Y| = k-1}
– X and Y must have the same number of elements, and must have exactly X and Y must have the same number of elements, and must have exactly k-1k-1 identical elements;identical elements;
– Every k-element subset of any resulting set element (an Every k-element subset of any resulting set element (an elementelement is actually a is actually a k+1 element set) has to belong to the original set of itemsets;k+1 element set) has to belong to the original set of itemsets;
Page Number:
28
The The Apriori AlgorithmApriori Algorithm
Example:Example:
TIDTID elementselements
1010 AA CC DD
2020 BB CC EE
3030 AA BB CC EE
4040 BB EE
Page Number:
29
The The Apriori AlgorithmApriori Algorithm
Step 1 – generate a candidate set of 1-itemsets Step 1 – generate a candidate set of 1-itemsets CC11
– Every possible 1-element set from the database is potentially a large itemset, Every possible 1-element set from the database is potentially a large itemset, because we don’t know the number of its appearances in the database in because we don’t know the number of its appearances in the database in advance (á priori advance (á priori ););
– The task adds up to identifying (counting) all the different elements in the The task adds up to identifying (counting) all the different elements in the database; every such element forms a 1-element candidate set;database; every such element forms a 1-element candidate set;
– Now, we are going to scan the entire database, to count the number of Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. appearances for each one of these elements (i.e. one-element setsone-element sets););
Page Number:
30
The The Apriori AlgorithmApriori Algorithm
Now, we are going to scan the entire database, to count the number of Now, we are going to scan the entire database, to count the number of appearances for each one of these elements (i.e. appearances for each one of these elements (i.e. one-element setsone-element sets););
{A}{A} 22
{B}{B} 33
{C}{C} 33
{D}{D} 11
{E}{E} 33
Page Number:
31
The The Apriori AlgorithmApriori Algorithm
Step 2 – generate a set of large 1-itemsets Step 2 – generate a set of large 1-itemsets LL11
– Each element in CEach element in C11 with support that exceeds some adopted minimum support with support that exceeds some adopted minimum support
(for example 50%) becomes a member of L(for example 50%) becomes a member of L11;;
and we can omit D in further and we can omit D in further steps (if D doesn’t have steps (if D doesn’t have enough support alone, enough support alone, there is no way it could there is no way it could satisfy requested support satisfy requested support in a combination with some in a combination with some other element(s));other element(s));
{A}{A} 22
{B}{B} 33
{C}{C} 33
{D}{D} 11
{E}{E} 33
Page Number:
32
The The Apriori AlgorithmApriori Algorithm
Step 3 – generate a candidate set of large 2-itemsets, Step 3 – generate a candidate set of large 2-itemsets, CC22
– CC22 = L = L11 L L11 ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}} ={{AB}, {AC}, {AE}, {BC}, {BE}, {CE}}
– Count the corresponding appearancesCount the corresponding appearances
Step 4 – generate a set of large 2-itemsets, Step 4 – generate a set of large 2-itemsets, LL22;;
– Eliminate the candidates Eliminate the candidates without minimum support;without minimum support;
– Why not {ABC} and {ACE} – because their 2-element subsets {AB} and {AE} are Why not {ABC} and {ACE} – because their 2-element subsets {AB} and {AE} are not the elements of large 2-itemset set Lnot the elements of large 2-itemset set L22 (calculation is made according to the (calculation is made according to the
operator operator definition); definition);
Step 6 (Step 6 (LL33))
– LL33 = {{BCE}}, since {BCE} satisfies the required support of 50% (two = {{BCE}}, since {BCE} satisfies the required support of 50% (two
appearances);appearances); There can be no further steps in this particular case, There can be no further steps in this particular case,
because Lbecause L33 L L33 = = ;;
Answer = LAnswer = L1 1 L L22 L L33;;
Page Number:
34
The The Apriori AlgorithmApriori Algorithm
LL11 = {large 1-itemsets} = {large 1-itemsets}
forfor (k=2; L (k=2; Lk-1 k-1 ; k++); k++)
CCkk = apriori-gen(L = apriori-gen(Lk-1k-1););
forallforall transactions t transactions t D D dodo beginbegin
Enhancements to the basic algorithmEnhancements to the basic algorithm Scan-reductionScan-reduction
– The most time consuming operation in Apriori algorithm is the database scan; it The most time consuming operation in Apriori algorithm is the database scan; it is originally performed after each candidate set generation, to determine the is originally performed after each candidate set generation, to determine the frequency of each candidate in the database;frequency of each candidate in the database;
– Scan number reduction – counting candidates of multiple sizes in one pass;Scan number reduction – counting candidates of multiple sizes in one pass;
– Rather than counting only candidates of size k in the kRather than counting only candidates of size k in the k thth pass, we can also pass, we can also calculate the candidates calculate the candidates C’C’k+1k+1, where , where C’C’k+1 k+1 is generated from is generated from CCkk (instead (instead LLkk), using ), using
the the operator; operator;
Page Number:
36
The The Apriori AlgorithmApriori Algorithm
– Compare: CCompare: C’’k+1k+1 = C = Ckk C Ck k C Ck+1k+1 = L = Lkk L Lkk
– Note that CNote that C’’k+1k+1 C Ck+1k+1
– This variation can pay off in later passes, when the cost of counting and keeping This variation can pay off in later passes, when the cost of counting and keeping in memory additional Cin memory additional C’’
k+1k+1 - C - Ck+1k+1 candidates becomes less than the cost of candidates becomes less than the cost of
scanning the database;scanning the database;
– There has to be enough space in main memory for both CThere has to be enough space in main memory for both Ckk and C and C’’k+1k+1;;
– Following this idea, we can make further scan reduction:Following this idea, we can make further scan reduction:
• C’k+1 is calculated from Ck for k > 1;
• There must be enough memory space for all Ck’s (k >
1);– Consequently, only two database scans need to be performed (the first to Consequently, only two database scans need to be performed (the first to
determine Ldetermine L11, and the second to determine all the other L, and the second to determine all the other Lkk’s);’s);
Page Number:
37
The The Apriori AlgorithmApriori Algorithm
Abstraction levelsAbstraction levels– Higher level associations are stronger (more powerful), but also less certain;Higher level associations are stronger (more powerful), but also less certain;
– A good practice would be adopting different thresholds for different abstraction A good practice would be adopting different thresholds for different abstraction levels (higher thresholds for higher levels of abstraction)levels (higher thresholds for higher levels of abstraction)
Business QuestionEvolutionary StepEvolutionary Step
Page Number:
52
Examples of DM projects to stimulate your imaginationExamples of DM projects to stimulate your imagination
Here are six examples of how data mining is helping corporations Here are six examples of how data mining is helping corporations to operate more efficiently and profitably in today's business environment to operate more efficiently and profitably in today's business environment
– Targeting a set of consumers Targeting a set of consumers who are most likely to respond to a direct mail campaign who are most likely to respond to a direct mail campaign
– Predicting the probability of default for consumer loan applicationsPredicting the probability of default for consumer loan applications
– Reducing fabrication flaws in VLSI chipsReducing fabrication flaws in VLSI chips
– Predicting audience share for television programsPredicting audience share for television programs
– Predicting the probability that a cancer patient Predicting the probability that a cancer patient will will respond to radiation therapyrespond to radiation therapy
– Predicting the probability that an offshore oil well is actually going Predicting the probability that an offshore oil well is actually going to produce oil to produce oil
Page Number:
53
Comparison of foComparison of fouurteen DM toolsrteen DM tools
Evaluated by four undergraduates inexperienced at data mining, Evaluated by four undergraduates inexperienced at data mining, a relatively experienced graduate student a relatively experienced graduate student,, and and a profes a professsional data mining consultantional data mining consultant
Run under the MS Windows 95, MS Windows NT, Run under the MS Windows 95, MS Windows NT, Macintosh System 7.5Macintosh System 7.5
Use one of the four technologies: Use one of the four technologies: Decision Trees, Rule Inductions, NeuralDecision Trees, Rule Inductions, Neural,, or Polynomial Networks or Polynomial Networks
Solve two binary classification problems: Solve two binary classification problems: multi-class classification and noiseless estimation problem multi-class classification and noiseless estimation problem
Price from 75$ to 25.000$Price from 75$ to 25.000$
Page Number:
54
Comparison of foComparison of fouurteen DM toolsrteen DM tools
The Decision Tree products were The Decision Tree products were - - CART CART
- Scenario - Scenario - See5 - See5
- S-Plus - S-Plus The Rule Induction tools were The Rule Induction tools were
- - WizWhy WizWhy - - DataMindDataMind
- - DMSK DMSK Neural Networks were built from three programsNeural Networks were built from three programs
- - NeuroShell2NeuroShell2- PcOLPARS - PcOLPARS
- - PRW PRW The Polynomial Network tools were The Polynomial Network tools were
- - ModelQuest Expert ModelQuest Expert - - Gnosis Gnosis - a module of - a module of NeuroShellNeuroShell22
- - KnowledgeMiner KnowledgeMiner
Page Number:
55
Criteria for evaluating DM toolsCriteria for evaluating DM tools
A list of 20 criteria for evaluating DM tools, put into 4 categories:A list of 20 criteria for evaluating DM tools, put into 4 categories:
CapabilityCapability measures what a desktop tool can do, measures what a desktop tool can do, and how well it does itand how well it does it
+ excellent capability excellent capability good capabilitygood capability - some capability “blank” no capabilitysome capability “blank” no capability
Page Number:
57
Criteria for evaluating DM toolsCriteria for evaluating DM tools
Learnability/UsabilityLearnability/Usability shows how easy a tool is to learn and use shows how easy a tool is to learn and use
- Tutorials- Tutorials- Wizards- Wizards
- Easy to learn- Easy to learn- User’s - User’s
manualmanual - Online help- Online help- -
Interface Interface
Page Number:
58
Criteria for evaluating DM toolsCriteria for evaluating DM tools
InteroperabilityInteroperability shows a tool’s ability to interface shows a tool’s ability to interface with other computer applicationswith other computer applications
- Importing data- Importing data- Exporting data- Exporting data
- Links to other applications- Links to other applications
Flexibility Flexibility
- Model adjustment flexibility- Model adjustment flexibility- Customizable work - Customizable work
enviromentenviroment - Ability to - Ability to write or change codewrite or change code
Page Number:
59
Data Input & Output ModelData Input & Output Model
+ excellent capability excellent capability good capabilitygood capability - some capabilitysome capability “ “blank” no capabilityblank” no capability
Page Number:
60
A classification of data setsA classification of data sets
Pima Indians Diabetes data setPima Indians Diabetes data set– 768 cases of Native American women from the Pima tribe 768 cases of Native American women from the Pima tribe
some of whom are diabetic, most of whom are not some of whom are diabetic, most of whom are not – 8 attributes plus the binary class variable for diabetes per instance8 attributes plus the binary class variable for diabetes per instance
Wisconsin Breast Cancer data set Wisconsin Breast Cancer data set – 699 instances of breast tumors some of which are malignant, 699 instances of breast tumors some of which are malignant,
most of which are benignmost of which are benign– 10 attributes plus the binary malignancy variable per case10 attributes plus the binary malignancy variable per case
The Forensic Glass Identification data set The Forensic Glass Identification data set – 214 instances of glass collected during crime investigations 214 instances of glass collected during crime investigations – 10 attributes plus the multi-class output variable per instance10 attributes plus the multi-class output variable per instance
Moon Cannon data set Moon Cannon data set – 300 solutions to the equation:300 solutions to the equation:
x = 2v 2 sin(g)cos(g)/g x = 2v 2 sin(g)cos(g)/g – the data were generated without adding noisethe data were generated without adding noise
Page Number:
61
Evaluation of forteen DM toolsEvaluation of forteen DM tools
Potentials of R&DPotentials of R&Dinin
Cooperation with U. of Belgrade Cooperation with U. of Belgrade
Nebojsa Uskokovic, and Fred DarnellNebojsa Uskokovic, and Fred Darnell
•isItWorking.com
Page Number:
68
Testing the Infrastructure for EBITesting the Infrastructure for EBI
PhonesPhones FaxesFaxes EmailEmail Web linksWeb links ServersServers RoutersRouters SoftwareSoftware
• Statistics
• Correlation
• Innovation
Page Number:
69
CNUCECNUCEIntegration and DataminingIntegration and Datamining
on Ad-Hoc Networks and the Interneton Ad-Hoc Networks and the Internet
Veljko Milutinović,
Luca Simoncini, and Enrico Gregory
*University of Pisa, Santanna, CNUCE
Page Number:
70
GSM
DMAd-Hoc
Internet
Page Number:
71
Genetic SearchGenetic Search with Spatial/Temporal Mutations with Spatial/Temporal Mutations
Jelena Mirković, Dragana Cvetković,and Veljko Milutinović
*Comshare
Page Number:
72
Drawbacks of INDEX-BASED:Drawbacks of INDEX-BASED: Time to index + ranking Time to index + ranking
Advantages of LINKS-BASED:Advantages of LINKS-BASED: Mission critical applications + customer tuned ranking Mission critical applications + customer tuned ranking
Provider
Well organized markets: Best first searchIf elements of disorder: G w DB mutationsChaotic markets: G w S/T mutations
Page Number:
73
e-Banking on the Internete-Banking on the Internet
MiloMiloš Kovačević,š Kovačević, Bratislav Milic, Veljko Milutinovi Bratislav Milic, Veljko Milutinović, ć, Marco Gori, and Roberto GiorgiMarco Gori, and Roberto Giorgi
*University of Siena
Page Number:
74
Bottleneck#1: Searching for Clients and InvestmentsBottleneck#1: Searching for Clients and Investments
1472++
*University of Siena + Banco di Monte dei Paschi
Page Number:
75
WaterMarking forWaterMarking fore-Banking on the Internete-Banking on the Internet
Darko Jovic, Ivana Vujovic, Veljko MilutinovicDarko Jovic, Ivana Vujovic, Veljko Milutinovic
Fraunhofer, IPSI, Darmstadt, Germany
Page Number:
76
Bottleneck#1: SpeedUpBottleneck#1: SpeedUp
Page Number:
77
SSGRRSSGRROrganizing Conferences via the InternetOrganizing Conferences via the Internet
Zoran Horvat, Nataša Kukulj, Vlada Stojanović,
Dušan Dingarac, Marjan Mihanović, Miodrag Stefanović,
Veljko Milutinović, and Frederic Patricelli
*SSGRR, L’Aquila
Page Number:
78
2000:
Arno Penzias
2001:
Bob Richardson
2002:
Jerry Friedman
2003:
Harry Kroto
http://www.ssgrr.it
Page Number:
79
SummarySummary
Books with Nobel Laureates:Books with Nobel Laureates:
Kenneth Wilson, Ohio (North-Holland)Kenneth Wilson, Ohio (North-Holland) Leon Cooper, Brown (Prentice-Hall)Leon Cooper, Brown (Prentice-Hall) Robert Richardson, Cornell (Kluwer-Academics)Robert Richardson, Cornell (Kluwer-Academics) Herb Simon (Kluwer-Academics) Herb Simon (Kluwer-Academics) Jerome Friedman, MIT (IOS Press)Jerome Friedman, MIT (IOS Press)
Harold Kroto (IOS Press)Harold Kroto (IOS Press) Arno Penzias (IOS Press)Arno Penzias (IOS Press)