9/1/2017 1 Data Mining Input: Concepts, Instances, and Attributes Chapter 2 of Data Mining Terminology 2 Components of the input: Concepts: kinds of things that can be learned Goal: intelligible and operational concept description E.g.: “Under what conditions should we play?” This concept is located somewhere in the input data Instances: the individual, independent examples of a concept Note: more complicated forms of input are possible Attributes: measuring aspects of an instance We will focus on nominal and numeric attributes
36
Embed
Data Mining Input: Concepts, Instances, and Attributescs.furman.edu/~ktreu/csc272/lectures/Chapter2.pdf · Concepts, Instances, ... relationships without pre-specified number of objects
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
9/1/2017
1
Data Mining Input: Concepts, Instances, and
AttributesChapter 2 of Data Mining
Terminology
2
Components of the input: Concepts: kinds of things that can be learned
Goal: intelligible and operational concept description
E.g.: “Under what conditions should we play?”
This concept is located somewhere in the input data
Instances: the individual, independent examples of a concept
Note: more complicated forms of input are possible
Attributes: measuring aspects of an instance We will focus on nominal and numeric attributes
9/1/2017
2
What is a concept?
3
Styles of learning: Classification learning:
understanding/predicting a discrete class
Association learning:detecting associations between features
Clustering:grouping similar instances into clusters
Numeric estimation:understanding/predicting a numeric quantity
Concept: thing to be learned
Concept description:output of learning scheme
Classification learning
4
Example problems: weather data, medical diagnosis, contact lenses, irises, labor negotiations, etc.
Can you think of others?
Classification learning is supervised Algorithm is provided with actual outcomes
Outcome is called the class attribute of the example
Measure success on fresh data for which class labels are known (test data, as opposed to training data)
In practice success is often measured subjectively How acceptable the learned description is to a human
user
9/1/2017
3
Association learning
5
Can be applied if no class is specified and any kind of structure is considered “interesting”
Difference from classification learning: Unsupervised
I.e., not told what to learn
Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time
Hence: far more association rules than classification rules
Thus: constraints are necessary Minimum coverage and minimum accuracy
Clustering
6
Finding groups of items that are similar
Clustering is unsupervised The class of an example is not known
If second person’s gender = femaleand first person’s parent1 = second person’s parent1then sister-of = yes
Generating a flat file
14
Process of flattening is called “denormalization” Several relations are joined together to make one
Possible with any finite set of finite relations More on this in CSC-341
Problematic: relationships without pre-specified number of objects
“sister of” contains two objects
concept of nuclear-family may be unknown
combinatorial explosion in the flat file
Denormalization may produce spurious regularities that reflect structure of database
Example: “supplier” predicts “supplier address”
9/1/2017
8
Multi‐instance Concepts
Each individual example comprises a set of instances
multiple instances may relate to the same example individual instances are not independent
All instances are described by the same attributes One or more instances within an example may be responsible for its classification
Goal of learning is still to produce a concept description
Examples multi‐day game activity (the weather data) classification of computer users as experts or novices response of users to multiple credit card promotions performance of a student over multiple classes
15
What’s in an attribute?
16
Each instance is described by a fixed predefined set of features, its “attributes”
But: number of attributes may vary in practice Example: table of transportation vehicles
Possible solution: “irrelevant value” flag
Related problem: existence of an attribute may depend on value of another one
Example: “spouse name” depends on “married?”
Possible solution: methods of data reduction
Possible attribute types (“levels of measurement”): Nominal, ordinal, interval and ratio
Simplifies to nominal and numeric
9/1/2017
9
Types of attributes• Nominal attributes have values that are "names" of categories.
– there is a small set of possible valuesattribute possible values
Fever {Yes, No}
Diagnosis {Allergy, Cold, Strep Throat}
Outlook {sunny, overcast, raining}
• In classification learning, the output attribute is always nominal.• Nominal comes from the Latin word for name
• No relation is implied among nominal values • No ordering or distance measure
• Can only test for equality
• Numeric attributes have values that come from a range of numbers.attribute possible values
Body Temp any value in 96.0‐106.0
Salary any value in $15,000‐250,000
– you can order their values (definition of “ordinal” type)$210,000 > $125,00098.6 < 101.3
Types of attributes• What about this one?
attribute possible values
Product Type {0, 1, 2, 3}
• If numbers are used as IDs or names of categories,the corresponding attribute is actually nominal.
• Note that it doesn't make sense to order the values of such attributes.
– example: product type 2 > product type 1doesn't have any meaning
• Also note that some nominal values can be ordinal:
– hot > mild > cool
– young < old
– freshman < sophomore < junior < senior
9/1/2017
10
Ordinal quantities
19
Impose order on values But no distance between values defined
Example:attribute “temperature” in weather data
Values: “hot” > “mild” > “cool”
Note: addition and subtraction don’t make sense
Example rule:temperature < hot play = yes
Distinction between nominal and ordinal not always clear (e.g. attribute “outlook” – is there an ordering?)
If age = young and astigmatic = noand tear production rate = normalthen recommendation = softIf age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft
If age pre-presbyopic and astigmatic = noand tear production rate = normalthen recommendation = soft
9/1/2017
11
Interval quantities
21
Interval quantities are not only ordered but measured in fixed numerical units
Example: attribute “year”
Difference of two values makes sense
Sum or product doesn’t make sense
Ratio quantities
22
Ratio quantities are those for which the measurement scheme defines a zero point
Example: attribute “distance” Distance between an object and itself is zero
Ratio quantities are treated as real numbers All mathematical operations are allowed
9/1/2017
12
Attribute types used in practice
23
Most schemes accommodate just two levels of measurement:
nominal and numeric, by which we typically only mean ordinal
Nominal attributes are also called “categorical”, ”enumerated”, or “discrete”
Ordinal attributes are also called “numeric”, or “continuous”
Preparing the input
24
Denormalization is not the only issue
Problem: different data sources (e.g. sales department, customer billing department, …)
Differences: styles of record keeping, conventions, time periods, primary keys, errors
Data must be assembled, integrated, cleaned up
“Data warehouse”: consistent point of access
External data may be required (“overlay data”)
9/1/2017
13
Missing values
25
Frequently indicated by out-of-range entries E.g. -999, “?”
Types: unknown, unrecorded, irrelevant
Reasons: malfunctioning equipment
changes in experimental design
collation of different datasets
measurement not possible
user refusal to answer survey question
Missing value may have significance in itself (e.g. missing test in a medical examination)
Most schemes assume that is not the case: “missing” may need to be coded as additional value
Inaccurate values
26
Reason: data has not been collected for the purpose of mining
Result: errors and omissions that don’t affect originalpurpose of data but are critical to mining E.g. age of customer in banking data
Typographical errors in nominal attributes values need to be checked for consistency
Typographical, measurement, rounding errors in numeric attributes outliers need to be identified What facility of Weka did we learn in lab that might be useful here?
Errors may be deliberate
E.g. wrong zip codes
Other problems: duplicates, stale data
9/1/2017
14
Noise
• Noisy data is meaningless data
• The term has often been used as a synonym for corrupt data
• Its meaning has expanded to include any data that cannot be understood and interpreted correctly by machines– unstructured text for example
• Addressing these issues requires a process of data cleaning
• Also called pre‐processing, or• Data wrangling (sometimes)g
9/1/2017
15
Getting to know the data
29
Simple visualization tools are very useful Nominal attributes: histograms
Q: Is the distribution consistent with background knowledge?
Numeric attributes: graphs
Q: Any obvious outliers?
2-D and 3-D plots show dependencies
Need to consult domain experts
Too much data to inspect? Take a sample!
More complex data viz tools represent an entire subdiscipline of Computer Science
The ARFF format
30
%% ARFF file for weather data with some numeric features%@relation weather
@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}
Simple Discretization Methods• What if we don't know which subranges make sense?
• Equal‐width binning divides the range of possible values into N subranges of the same size.
– bin width = (max value – min value) / N
– example: if the observed values are all between 0‐100, we could create 5 bins as follows:
width = (100 – 0)/5 = 20
bins: [0‐20], (20‐40], (40‐60], (60‐80], (80‐100]
– problems with this equal‐width approach?
9/1/2017
25
Simple Discretization Methods (cont.)• Equal‐frequency or equal‐height binning divides the range of possible values into N bins, each of which holds the same number of training instances.
– example: let's say we have 10 training examples with the following values for the attribute that we're discretizing:
5, 7, 12, 35, 65, 82, 84, 88, 90, 95
Simple Discretization Methods (cont.)• Equal‐frequency or equal‐height binning divides the range of possible values into N bins, each of which holds the same number of training instances.
– example: let's say we have 10 training examples with the following values for the attribute that we're discretizing:
5, 7, 12, 35, 65, 82, 84, 88, 90, 95
to create 5 bins, we would divide up the range of values so that each bin holds 2 of the training examples
9/1/2017
26
Simple Discretization Methods (cont.)• Equal‐frequency or equal‐height binning divides the range of possible values into N bins, each of which holds the same number of training instances.
– example: let's say we have 10 training examples with the following values for the attribute that we're discretizing:
5, 7, 12, 35, 65, 82, 84, 88, 90, 95
To select the boundary values for the bins, this method typically chooses a value halfway between the training examples on either side of the boundaryfinal bins: (‐inf, 9.5], (9.5, 50], (50, 83], (83, 89], (89, inf)
Simple Discretization Methods (cont.)• Equal‐frequency or equal‐height binning divides the range of possible values into N bins, each of which holds the same number of training instances.
– example: let's say we have 10 training examples with the following values for the attribute that we're discretizing:
5, 7, 12, 35, 65, 82, 84, 88, 90, 95
– Problems with this approach?
9/1/2017
27
Other Discretization Methods• Ideally, we'd like to come up with bins that capture distinctions that will be useful in data mining.
– example: if we're discretizing body temperature, we'd like the discretization method to learn that 98.6 F is an important boundary value
Other Discretization Methods• Ideally, we'd like to come up with bins that capture distinctions that will be useful in data mining.
– example: if we're discretizing body temperature, we'd like the discretization method to learn that 98.6 F is an important boundary value
– more generally, we want to capture distinctions that will help us to learn to predict/estimate the class of an example
9/1/2017
28
Other Discretization Methods• Both equal‐width and equal‐frequency binning are considered unsupervisedmethods, because they don't take into account the class values of the training examples
Other Discretization Methods• Both equal‐width and equal‐frequency binning are considered unsupervisedmethods, because they don't take into account the class values of the training examples
• There are supervisedmethods for discretization that attempt to take the class values into account
– Minimum bucket size
9/1/2017
29
Discretization in Weka• In Weka, you can discretize an attribute by applying the appropriate filter to it
• After loading in the dataset in the Preprocess tab, click the Choose button in the Filter portion of the tab
Discretization in Weka• In Weka, you can discretize an attribute by applying the appropriate filter to it
• After loading in the dataset in the Preprocess tab, click the Choose button in the Filter portion of the tab
• For equal‐width or equal‐height, you choose the Discretize option in the filters/unsupervised/attribute folder– by default, it uses equal‐width binning
– to use equal‐frequency binning instead, click on thename of the filter and set the useEqualFrequencyparameter to True
9/1/2017
30
Discretization in Weka• In Weka, you can discretize an attribute by applying the appropriate filter to it
• After loading in the dataset in the Preprocess tab, click the Choose button in the Filter portion of the tab
• For supervised discretization, choose the Discretize option in the filters/supervised/attribute folder
Nominal Attributes with Numeric Values• Some attributes that use numeric values may actually be nominal attributes
– the attribute has a small number of possible values
– there is no ordering to the values, and you would never perform mathematical operations on them
– example: an attribute that uses numeric codes for medical diagnoses• 1 = Strep Throat, 2 = Cold, 3 = Allergy
9/1/2017
31
Nominal Attributes with Numeric Values
• If you load a comma‐separated‐value file containing such an attribute, Weka will assume that it is numeric
• To force Weka to treat an attribute with numeric values as nominal, use the NumericToNominal option in the filters/unsupervised/attribute folder– click on the name of the filter, and enter the number(s) of the attributes
you want to convert
• Or edit the ARFF file manually…
Handling Missing Values
• Options:
– Ignore them• PRISM and ID3 won’t work at all
• Naïve Bayes handles them fine
• J48 and nearest neighbor use tricks to get around
– Remove all instances with missing attributes• Unsupervised RemoveWithValues attribute filter in Weka
– Replace missing values with the most common value for that attribute
• Unsupervised ReplaceMissingValues attribute filter in Weka
• Only works with nominal values
• Issues?
9/1/2017
32
Handing Noisy Data • Noise:
– random error or variance in a measured attribute
– outlier values
– more generally: non‐predictive values
• Combined computer and human inspection– detect suspicious values and check by human
– data visualization the key tool
• Clustering– detect and remove outliers
– also employs data viz
• Regression– smooth by fitting the data into regression functions
• Clustering– employs techniques similar to discretizing
Simple Binning Method• Sorted attribute values:
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
• Partition into (equal‐depth) bins:
– Bin 1: 4, 8, 9, 15
– Bin 2: 21, 21, 24, 25
– Bin 3: 26, 28, 29, 34
• Smoothing by bin averages:
– Bin 1: 9, 9, 9, 9
– Bin 2: 23, 23, 23, 23
– Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
– Bin 1: 4, 4, 4, 15
– Bin 2: 21, 21, 25, 25
– Bin 3: 26, 26, 26, 34
• Note how smoothing mitigates a noisy/outlier value
9/1/2017
33
Data Reduction• Problematic attributes include:
– irrelevant attributes: ones that don't help to predict the class
• despite their irrelevance, the algorithm may erroneously include them in the model
Data Reduction• Problematic attributes include:
– irrelevant attributes: ones that don't help to predict the class
• despite their irrelevance, the algorithm may erroneously include them in the model
– attributes that cause overfitting• example: a unique identifier such as Patient ID
9/1/2017
34
Data Reduction• Problematic attributes include:
– irrelevant attributes: ones that don't help to predict the class
• despite their irrelevance, the algorithm may erroneously include them in the model
– attributes that cause overfitting• example: a unique identifier such as Patient ID
– redundant attributes: those that offer basically the same information as another attribute
• example: in many problems, date‐of‐birth and ageprovide the same information
• some algorithms may end up giving the information fromthese attributes too much weight
Data Reduction• We can remove an attribute manually in Weka by clicking the checkbox next to the attribute in the Preprocess tab and then clicking the Remove button
– How to determine?• Experimentation
• Correlation analysis (filters in Weka)
• Undoing preprocess actions:– In the Preprocess tab, the Undo button allows you to undo actions that you
perform, including:• applying a filter to a dataset
• manually removing one or more attributes
– If you apply two filters without using Undo in between the two, the second filter will be applied to the results of the first filter
– Undo can be pressed multiple times to undo a sequence of actions
9/1/2017
35
Dividing Up the Data File• To allow us to validate the model(s) learned in data mining, we'll divide the examples into two files:
– n% for training
– 100 – n% for testing: these should not be touched until you have finalized your model or models
– possible splits:• 67/33
• 80/20
• 90/10
• Alternative to ten‐fold cross validation when you have a sufficiently large dataset
Dividing Up the Data File• You can use Weka to split the dataset for you after you perform whatever reformatting/editing is needed
• If you discretize one or more attributes, you need to do so before you divide up the data file
– otherwise, the training and test sets will be incompatible
9/1/2017
36
Dividing Up the Data File (cont.)• Here's one way to do it in Weka:
1) shuffle the examples by choosing the Randomize filter from the
filters/unsupervised/instance folder
2) save the entire file of shuffled examples in Arff format.
3) use the RemovePercentage filter from the same folder to remove some percentage of the examples• whatever percentage you're using for the training set
• click on the name of the filter to set the percentage
4) save the remaining examples in a new file• this will be our test data
5) load the full file of shuffled examples back into Weka
6) use RemovePercentage again with the same percentage as before, but set invertSelection to True
7) save the remaining examples in a new file• this will be our training data