IT IT IT IT [1] @gsantosgo @gsantosgo @gsantosgo @gsantosgo Information Tecnology Information Tecnology Information Tecnology Information Tecnology Information Tecnology Information Tecnology Information Tecnology Information Tecnology Madrid JUG Madrid JUG Madrid JUG Madrid JUG - Minería de Datos sobre Weka (Data Mining) Minería de Datos sobre Weka (Data Mining) Minería de Datos sobre Weka (Data Mining) Minería de Datos sobre Weka (Data Mining) 9 de Mayo 2013 Jose María Gómez Hidalgo (@jmgomez) Guillermo Santos García (@gsantosgo) DATA MINING
27
Embed
MadridJUG Mineria de Datos-Data Mining.09.may.2013
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Madrid JUG Madrid JUG Madrid JUG Madrid JUG ---- Minería de Datos sobre Weka (Data Mining) Minería de Datos sobre Weka (Data Mining) Minería de Datos sobre Weka (Data Mining) Minería de Datos sobre Weka (Data Mining)
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
INDEXINDEXINDEXINDEX
Madrid JUG - Minería de Datos sobre Weka (Data Mining) ............................................................................................... 1
INDEX ...................................................................................................................................................................................... 2
1.1 Knowledge Based System vs. Machine Learning System ....................................................................................... 5
2. Data Mining Process ......................................................................................................................................................... 6
2.1.3 The Top Ten Algorithms in Data Mining ................................................................................................................. 9
3.1 WEKA (Waikato Environment for Knowledge Analysis) ................................................................................... 10
3.2 R (#RStats) ........................................................................................................................................................... 10
4.1 Predicting Price House ........................................................................................................................................ 15
4.2 Lending Club ........................................................................................................................................................ 16
4.3 Spam or Ham Email ............................................................................................................................................. 17
6.1 Random Subsampling .............................................................................................................................................. 24
A.1. ¿What is a DATASET? .................................................................................................................................................... 26
A.2 Types of variables .......................................................................................................................................................... 26
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
1111....1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System1 Knowledge Based System vs. Machine Learning System
Knowledge Based System (Expert System)Knowledge Based System (Expert System)Knowledge Based System (Expert System)Knowledge Based System (Expert System)
- Rules are codified manually (Represent knowledge)
- Experts (expert is a person with extensive knowledge about domain).
- Cost.
Expert Sytems (Credit Expert System)
If (Annual Income > 3 * Annual Debt) Then CREDIT = YES
Aim. Building or creating programs capable of generalizgeneralizgeneralizgeneralizing ing ing ing behaviorbehaviorbehaviorbehavior from weakly structured information.
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
2.2.2.2.1.3 The Top Ten Algorithms in Data Mining 1.3 The Top Ten Algorithms in Data Mining 1.3 The Top Ten Algorithms in Data Mining 1.3 The Top Ten Algorithms in Data Mining
IEEE International Conference on Data Mining (ICDM). http://www.cs.uvm.edu/~icdm/
The most influential algorithms used in the Data Mining Community.
1. C 4.5 (Decision Tree).
2. K-Means.
3. Support Vector Machine (SVM). The Best Generalization Ability
4. Apriori. To find frequent itemsets from a transaction dataset and derive association rules
5. EM (Expectation- Maximization) Pattern Recognition
6. PageRank. Link-based ranking algorithm, which also powers the Google search engine.
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
3.6 Polls 3.6 Polls 3.6 Polls 3.6 Polls
3.6.1 3.6.1 3.6.1 3.6.1 What programming/statistics languages you used for analytics / data mining in the past 12 What programming/statistics languages you used for analytics / data mining in the past 12 What programming/statistics languages you used for analytics / data mining in the past 12 What programming/statistics languages you used for analytics / data mining in the past 12
3.6.2 3.6.2 3.6.2 3.6.2 What Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in tWhat Analytics, Data mining, Big Data software you used in the past 12 months for a real he past 12 months for a real he past 12 months for a real he past 12 months for a real
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
4444.5.5.5.5 Human Activity RHuman Activity RHuman Activity RHuman Activity Recognition using Smartphonesecognition using Smartphonesecognition using Smartphonesecognition using Smartphones
We used data obtained from accelerometer and gyroscope sensor signals of the smartphones
3-axial linear acceleration
3-axial angular velocity
We can monitor acceleration, positions, rotation and angular motion.
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
Information TecnologyInformation TecnologyInformation TecnologyInformation Tecnology
A.1A.1A.1A.1. . . . ¿What is a DATASET? ¿What is a DATASET? ¿What is a DATASET? ¿What is a DATASET?
Example: Dataset email50
Row represents a casecasecasecase, a unit of observationunit of observationunit of observationunit of observation, an observational unitobservational unitobservational unitobservational unit, an instanceinstanceinstanceinstance. OBSERVATIONS.OBSERVATIONS.OBSERVATIONS.OBSERVATIONS.
EXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARYEXAMPLE OR EXEMPLARY. . . .
Column represents an attributeattributeattributeattribute, a variablevariablevariablevariable, a featurefeaturefeaturefeature (represent characteristics).
Special column. the classthe classthe classthe class, the class labelthe class labelthe class labelthe class label ( two values or multi-valued)
For example: The email 4, which is not spam, contains 2454 characters, 61 line breaks, is written in Text format
(0=text, 1=html), and contains only small numbers.
Variable Description
spam Specifies whether the message was spam
num_char The number of characters in the email
line_breaks The number of line breaks in the email (not including text
wrapping)
Format Indicates if the email contained special formatting, such as
bolding, tables or links, which would indicate the message is
in HTML format
Number Indicates whether the email contained no number, a small
number (under 1 million) or a large number
DatasetDatasetDatasetDataset represents a data matrixdata matrixdata matrixdata matrix, data framedata framedata framedata frame. Each row of a data matrix corresponds to unique case
(example), and each column corresponds to a variable.
A.2A.2A.2A.2 Types of variablesTypes of variablesTypes of variablesTypes of variables