Data Mining Input: Concepts, Instances, and Attributescs.furman.edu/~ktreu/csc272/lectures/Chapter2.pdf · Concepts, Instances, ... relationships without pre-specified number of objects

9/1/2017

1

Data Mining Input: Concepts, Instances, and

AttributesChapter 2 of Data Mining

Terminology

2

Components of the input: Concepts: kinds of things that can be learned

Goal: intelligible and operational concept description

E.g.: “Under what conditions should we play?”

This concept is located somewhere in the input data

Instances: the individual, independent examples of a concept

Note: more complicated forms of input are possible

Attributes: measuring aspects of an instance We will focus on nominal and numeric attributes

9/1/2017

2

What is a concept?

3

Styles of learning: Classification learning:

understanding/predicting a discrete class

Association learning:detecting associations between features

Clustering:grouping similar instances into clusters

Numeric estimation:understanding/predicting a numeric quantity

Concept: thing to be learned

Concept description:output of learning scheme

Classification learning

4

Example problems: weather data, medical diagnosis, contact lenses, irises, labor negotiations, etc.

Can you think of others?

Classification learning is supervised Algorithm is provided with actual outcomes

Outcome is called the class attribute of the example

Measure success on fresh data for which class labels are known (test data, as opposed to training data)

In practice success is often measured subjectively How acceptable the learned description is to a human

user

9/1/2017

3

Association learning

5

Can be applied if no class is specified and any kind of structure is considered “interesting”

Difference from classification learning: Unsupervised

I.e., not told what to learn

Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time

Hence: far more association rules than classification rules

Thus: constraints are necessary Minimum coverage and minimum accuracy

Clustering

6

Finding groups of items that are similar

Clustering is unsupervised The class of an example is not known

Success often measured subjectively

…

…

…

Iris virginica1.95.12.75.8102101

5251

21

Iris virginica2.56.03.36.3

Iris versicolor1.54.53.26.4Iris versicolor1.44.73.27.0

Iris setosa0.21.43.04.9Iris setosa0.21.43.55.1

TypePetal widthPetal lengthSepal widthSepal length

9/1/2017

4

Numeric estimation

7

Variant of classification learning where the output attribute is numeric (also called “regression”)

Learning is supervised Algorithm is provided with target values

Measure success on test data

……………40FalseNormalMildRainy55FalseHighHot Overcast 0TrueHigh Hot Sunny5FalseHighHotSunny

Play-timeWindyHumidityTemperatureOutlook

Some input terminology

modelinput attributes output attribute

rulesor

treeor…

feverswollen glands

headache…

diagnosis

• Each row in a collection of training data is known as an exampleor instance.

• Each column is referred to as an attribute.

• The attributes can be divided into two types:– the output attribute – the one we want to determine/predict

– the input attributes – everything else

• Example:

9/1/2017

5

What’s in an example?

9

Instance: specific type of example Thing to be classified, associated, or clustered

Individual, independent example of target concept

Characterized by a predetermined set of attributes

Input to learning scheme: set of independent instances dataset Represented as a single relation/flat file

Note difference from relational database

Rather restricted form of input No relationships between objects/instances

Most common form in practical data mining

Example: A family tree

10

=

Steven M

Graham M

Pam F

Grace F

RayM

=

Ian M

Pippa F

Brian M

=

Anna F

Nikki F

PeggyF

PeterM

9/1/2017

6

Family tree represented as a table

11

IanPamFemaleNikkiIanPamFemaleAnnaRayGraceMaleBrianRayGraceFemalePippaRayGraceMaleIan

PeggyPeterFemalePamPeggyPeterMaleGrahamPeggyPeterMaleSteven

??FemalePeggy??MalePeter

parent2Parent1GenderName

The “sister‐of” relation:Two versions

12

yesAnnaNikki………

YesNikkiAnna………

YesPippaIan………

YesPamStevenNoGrahamStevenNoPeterSteven………NoStevenPeterNoPeggyPeter

Sister of?Second person

First person

NoAll the rest

YesAnnaNikkiYesNikkiAnnaYesPippaBrianYesPippaIanYesPamGrahamYesPamSteven

Sister of?Second person

First person

Closed-world assumption

9/1/2017

7

A full representation in one flat file table

13

IanIanRayRay

PeggyPeggy

Parent2

FemaleFemaleFemaleFemaleFemaleFemaleGender

PamPam

GraceGracePeterPeter

Parent1NameParent2Parent1GenderName

IanIanRayRay

PeggyPeggy

PamPam

GraceGracePeterPeter

FemaleFemaleMaleMaleMaleMale

NoAll the rest

YesAnnaNikkiYesNikkiAnnaYesPippaBrianYesPippaIanYesPamGrahamYesPamSteven

Sisterof?

Second personFirst person

If second person’s gender = femaleand first person’s parent1 = second person’s parent1then sister-of = yes

Generating a flat file

14

Process of flattening is called “denormalization” Several relations are joined together to make one

Possible with any finite set of finite relations More on this in CSC-341

Problematic: relationships without pre-specified number of objects

“sister of” contains two objects

concept of nuclear-family may be unknown

combinatorial explosion in the flat file

Denormalization may produce spurious regularities that reflect structure of database

Example: “supplier” predicts “supplier address”

9/1/2017

8

Multi‐instance Concepts

Each individual example comprises a set of instances

multiple instances may relate to the same example individual instances are not independent

All instances are described by the same attributes One or more instances within an example may be responsible for its classification

Goal of learning is still to produce a concept description

Examples multi‐day game activity (the weather data) classification of computer users as experts or novices response of users to multiple credit card promotions performance of a student over multiple classes

15

What’s in an attribute?

16

Each instance is described by a fixed predefined set of features, its “attributes”

But: number of attributes may vary in practice Example: table of transportation vehicles

Possible solution: “irrelevant value” flag

Related problem: existence of an attribute may depend on value of another one

Example: “spouse name” depends on “married?”

Possible solution: methods of data reduction

Possible attribute types (“levels of measurement”): Nominal, ordinal, interval and ratio

Simplifies to nominal and numeric

9/1/2017

9

Types of attributes• Nominal attributes have values that are "names" of categories.

– there is a small set of possible valuesattribute possible values

Fever {Yes, No}

Diagnosis {Allergy, Cold, Strep Throat}

Outlook {sunny, overcast, raining}

• In classification learning, the output attribute is always nominal.• Nominal comes from the Latin word for name

• No relation is implied among nominal values • No ordering or distance measure

• Can only test for equality

• Numeric attributes have values that come from a range of numbers.attribute possible values

Body Temp any value in 96.0‐106.0

Salary any value in $15,000‐250,000

– you can order their values (definition of “ordinal” type)$210,000 > $125,00098.6 < 101.3

Types of attributes• What about this one?

attribute possible values

Product Type {0, 1, 2, 3}

• If numbers are used as IDs or names of categories,the corresponding attribute is actually nominal.

• Note that it doesn't make sense to order the values of such attributes.

– example: product type 2 > product type 1doesn't have any meaning

• Also note that some nominal values can be ordinal:

– hot > mild > cool

– young < old

– freshman < sophomore < junior < senior

9/1/2017

10

Ordinal quantities

19

Impose order on values But no distance between values defined

Example:attribute “temperature” in weather data

Values: “hot” > “mild” > “cool”

Note: addition and subtraction don’t make sense

Example rule:temperature < hot play = yes

Distinction between nominal and ordinal not always clear (e.g. attribute “outlook” – is there an ordering?)

Nominal vs. ordinal

20

Attribute “age” nominal

Attribute “age” ordinal(e.g. “young” < “pre-presbyopic” < “presbyopic”)

If age = young and astigmatic = noand tear production rate = normalthen recommendation = softIf age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

If age pre-presbyopic and astigmatic = noand tear production rate = normalthen recommendation = soft

9/1/2017

11

Interval quantities

21

Interval quantities are not only ordered but measured in fixed numerical units

Example: attribute “year”

Difference of two values makes sense

Sum or product doesn’t make sense

Ratio quantities

22

Ratio quantities are those for which the measurement scheme defines a zero point

Example: attribute “distance” Distance between an object and itself is zero

Ratio quantities are treated as real numbers All mathematical operations are allowed

9/1/2017

12

Attribute types used in practice

23

Most schemes accommodate just two levels of measurement:

nominal and numeric, by which we typically only mean ordinal

Nominal attributes are also called “categorical”, ”enumerated”, or “discrete”

Ordinal attributes are also called “numeric”, or “continuous”

Preparing the input

24

Denormalization is not the only issue

Problem: different data sources (e.g. sales department, customer billing department, …)

Differences: styles of record keeping, conventions, time periods, primary keys, errors

Data must be assembled, integrated, cleaned up

“Data warehouse”: consistent point of access

External data may be required (“overlay data”)

9/1/2017

13

Missing values

25

Frequently indicated by out-of-range entries E.g. -999, “?”

Types: unknown, unrecorded, irrelevant

Reasons: malfunctioning equipment

changes in experimental design

collation of different datasets

measurement not possible

user refusal to answer survey question

Missing value may have significance in itself (e.g. missing test in a medical examination)

Most schemes assume that is not the case: “missing” may need to be coded as additional value

Inaccurate values

26

Reason: data has not been collected for the purpose of mining

Result: errors and omissions that don’t affect originalpurpose of data but are critical to mining E.g. age of customer in banking data

Typographical errors in nominal attributes values need to be checked for consistency

Typographical, measurement, rounding errors in numeric attributes outliers need to be identified What facility of Weka did we learn in lab that might be useful here?

Errors may be deliberate

E.g. wrong zip codes

Other problems: duplicates, stale data

9/1/2017

14

Noise

• Noisy data is meaningless data

• The term has often been used as a synonym for corrupt data

• Its meaning has expanded to include any data that cannot be understood and interpreted correctly by machines– unstructured text for example

• Addressing these issues requires a process of data cleaning

• Also called pre‐processing, or• Data wrangling (sometimes)g

9/1/2017

15

Getting to know the data

29

Simple visualization tools are very useful Nominal attributes: histograms

Q: Is the distribution consistent with background knowledge?

Numeric attributes: graphs

Q: Any obvious outliers?

2-D and 3-D plots show dependencies

Need to consult domain experts

Too much data to inspect? Take a sample!

More complex data viz tools represent an entire subdiscipline of Computer Science

The ARFF format

30

%% ARFF file for weather data with some numeric features%@relation weather

@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}

@datasunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yes...

9/1/2017

16

Additional attribute types

ARFF supports string attributes:

Similar to nominal attributes but list of values is not pre‐specified

It also supports date attributes:

Uses the ISO‐8601 combined date and time format yyyy‐MM‐dd‐THH:mm:ss

31

@attribute description string

@attribute today date

Sparse data In some applications most attribute values in a dataset are zero

word counts in a text categorization problem

product counts in market basket analysis

ARFF supports sparse data

32

0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A”0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”

{1 26, 6 63, 10 “class A”}{3 42, 10 “class B”}

9/1/2017

17

Finding datasets

• Many sources:– Google’s Public Data Explorer

– UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html)

– FedStats (http://fedstats.sites.usa.gov/)

– U.S. Census Bureau

– UNdata (http://data.un.org/)

– National Space Science Data Center

– Journal of Statistics Education data archive

– KDnuggets dataset repository

– Kaggle.com (feel like winning some money?)

– Search for “dataset” and the subject you’re interested in

– Tools for data scraping from the web

Applied Pre‐Processing• Review: The Data Mining Process

• Key steps:

– assemble the data in the format needed for data mining

• typically a text file

• referred to as pre‐processing:– Major tasks: integration, transformation, cleaning, reduction

– perform the data mining

– interpret/evaluate the results

– apply the results

9/1/2017

18

Why Data Pre‐processing?

• Data in the real world is dirty

• incomplete: lacking attribute values, lacking certain attributes of interest

– e.g., occupation=“ ”

• noisy: containing errors or outliers– e.g., Salary=“‐10”

• inconsistent: containing discrepancies in codes or names

– e.g., Age=“42” Birthday=“03/07/1997”

– e.g., Was rating “1,2,3”, now rating “A, B, C”

– e.g., discrepancy between duplicate records

Why is Data Dirty?• Incomplete data (missing values) may come from

– “Not applicable” data value when collected

– Different considerations between the time when the data was collected and when it is analyzed

– Human/hardware/software problems

• Noisy data (incorrect values) may come from– Faulty data collection instruments

– Human or computer error at data entry

– Errors in data transmission

• Inconsistent data may come from– Different data sources (resulting from integration)

– Functional dependency violation (e.g., modify some linked data)

9/1/2017

19

Why Data Pre‐Processing?

• No quality data, no quality mining results!

• Quality decisions must be based on quality data

• Data extraction, integration, cleaning, transformation, and reduction comprises the majority of the work of building target data

Data Integration• In designing a database, we try to avoid redundancies by normalizing the data

• As a result, the data for a given entity (e.g., a customer) may be:

– spread over multiple tables

– spread over multiple records within a given table

9/1/2017

20

• In designing a database, we try to avoid redundancies by normalizing the data.

• As a result, the data for a given entity (e.g., a customer) may be:

– spread over multiple tables

– spread over multiple records within a given table

• To prepare for data warehousing and/or data mining, we often need to denormalize the data.

– multiple records for a given entity a single record

Data Integration

• Example: a simple database design

– Normalized

– Denormalized version would have a single table, with one instance for every order, with customer and product information repeated

Data Integration

9/1/2017

21

Transforming the Data

• We may also need to reformat or transform the data.

– we can use a Python program to do the reformatting

– Weka also provides several useful filters

• One reason for transforming the data: many machine‐learning algorithms can only handle certain types of data

Transforming the Data• We may also need to reformat or transform the data.



• One reason for transforming the data: many machine‐learning algorithms can only handle certain types of data

– some algorithms only work with nominal attributes –attributes with a specified set of possible values

• examples: {yes, no}

{strep throat, cold, allergy}

9/1/2017

22

Transforming the Data• We may also need to reformat or transform the data.



• One reason for transforming the data: many machine‐learning algorithms can only handle certain types of data.

– some algorithms only work with nominal attributes –attributes with a specified set of possible values

• examples: {yes, no}

{strep throat, cold, allergy}

– other algorithms only work with numeric attributes

Discretizing Numeric Attributes• We can turn a numeric attribute into a nominal/categorical one by using some sort of discretization

• This involves dividing the range of possible values into subranges called buckets or bins.

– example: an age attribute could be divided into these bins:child: 0‐12teen: 12‐17young: 18‐35middle: 36‐59senior: 60‐

9/1/2017

23

Simple Discretization Methods• What if we don't know which subranges make sense?

• Equal‐width binning divides the range of possible values into N subranges of the same size.

– bin width = (max value – min value) / N

– example: if the observed values are all between 0‐100, we could create 5 bins as follows:

width = (100 – 0)/5 = 20





width = (100 – 0)/5 = 20

bins: [0‐20], (20‐40], (40‐60], (60‐80], (80‐100][ or ] means the endpoint is included( or ) means the endpoint is not included

9/1/2017

24





width = (100 – 0)/5 = 20

bins: [0‐20], (20‐40], (40‐60], (60‐80], (80‐100]

– typically, the first and last bins are extended to allowfor values outside the range of observed values

(‐infinity‐20], (20‐40], (40‐60], (60‐80], (80‐infinity)





width = (100 – 0)/5 = 20

bins: [0‐20], (20‐40], (40‐60], (60‐80], (80‐100]

– problems with this equal‐width approach?

9/1/2017

25

Simple Discretization Methods (cont.)• Equal‐frequency or equal‐height binning divides the range of possible values into N bins, each of which holds the same number of training instances.

– example: let's say we have 10 training examples with the following values for the attribute that we're discretizing:

5, 7, 12, 35, 65, 82, 84, 88, 90, 95



5, 7, 12, 35, 65, 82, 84, 88, 90, 95

to create 5 bins, we would divide up the range of values so that each bin holds 2 of the training examples

9/1/2017

26



5, 7, 12, 35, 65, 82, 84, 88, 90, 95

To select the boundary values for the bins, this method typically chooses a value halfway between the training examples on either side of the boundaryfinal bins: (‐inf, 9.5], (9.5, 50], (50, 83], (83, 89], (89, inf)



5, 7, 12, 35, 65, 82, 84, 88, 90, 95

– Problems with this approach?

9/1/2017

27

Other Discretization Methods• Ideally, we'd like to come up with bins that capture distinctions that will be useful in data mining.

– example: if we're discretizing body temperature, we'd like the discretization method to learn that 98.6 F is an important boundary value

Other Discretization Methods• Ideally, we'd like to come up with bins that capture distinctions that will be useful in data mining.

– example: if we're discretizing body temperature, we'd like the discretization method to learn that 98.6 F is an important boundary value

– more generally, we want to capture distinctions that will help us to learn to predict/estimate the class of an example

9/1/2017

28

Other Discretization Methods• Both equal‐width and equal‐frequency binning are considered unsupervisedmethods, because they don't take into account the class values of the training examples

Other Discretization Methods• Both equal‐width and equal‐frequency binning are considered unsupervisedmethods, because they don't take into account the class values of the training examples

• There are supervisedmethods for discretization that attempt to take the class values into account

– Minimum bucket size

9/1/2017

29

Discretization in Weka• In Weka, you can discretize an attribute by applying the appropriate filter to it

• After loading in the dataset in the Preprocess tab, click the Choose button in the Filter portion of the tab



• For equal‐width or equal‐height, you choose the Discretize option in the filters/unsupervised/attribute folder– by default, it uses equal‐width binning

– to use equal‐frequency binning instead, click on thename of the filter and set the useEqualFrequencyparameter to True

9/1/2017

30



• For supervised discretization, choose the Discretize option in the filters/supervised/attribute folder

Nominal Attributes with Numeric Values• Some attributes that use numeric values may actually be nominal attributes

– the attribute has a small number of possible values

– there is no ordering to the values, and you would never perform mathematical operations on them

– example: an attribute that uses numeric codes for medical diagnoses• 1 = Strep Throat, 2 = Cold, 3 = Allergy

9/1/2017

31

Nominal Attributes with Numeric Values

• If you load a comma‐separated‐value file containing such an attribute, Weka will assume that it is numeric

• To force Weka to treat an attribute with numeric values as nominal, use the NumericToNominal option in the filters/unsupervised/attribute folder– click on the name of the filter, and enter the number(s) of the attributes

you want to convert

• Or edit the ARFF file manually…

Handling Missing Values

• Options:

– Ignore them• PRISM and ID3 won’t work at all

• Naïve Bayes handles them fine

• J48 and nearest neighbor use tricks to get around

– Remove all instances with missing attributes• Unsupervised RemoveWithValues attribute filter in Weka

– Replace missing values with the most common value for that attribute

• Unsupervised ReplaceMissingValues attribute filter in Weka

• Only works with nominal values

• Issues?

9/1/2017

32

Handing Noisy Data • Noise:

– random error or variance in a measured attribute

– outlier values

– more generally: non‐predictive values

• Combined computer and human inspection– detect suspicious values and check by human

– data visualization the key tool

• Clustering– detect and remove outliers

– also employs data viz

• Regression– smooth by fitting the data into regression functions

• Clustering– employs techniques similar to discretizing

Simple Binning Method• Sorted attribute values:

4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

• Partition into (equal‐depth) bins:

– Bin 1: 4, 8, 9, 15

– Bin 2: 21, 21, 24, 25

– Bin 3: 26, 28, 29, 34

• Smoothing by bin averages:

– Bin 1: 9, 9, 9, 9

– Bin 2: 23, 23, 23, 23

– Bin 3: 29, 29, 29, 29

• Smoothing by bin boundaries:

– Bin 1: 4, 4, 4, 15

– Bin 2: 21, 21, 25, 25

– Bin 3: 26, 26, 26, 34

• Note how smoothing mitigates a noisy/outlier value

9/1/2017

33

Data Reduction• Problematic attributes include:

– irrelevant attributes: ones that don't help to predict the class

• despite their irrelevance, the algorithm may erroneously include them in the model




– attributes that cause overfitting• example: a unique identifier such as Patient ID

9/1/2017

34




– attributes that cause overfitting• example: a unique identifier such as Patient ID

– redundant attributes: those that offer basically the same information as another attribute

• example: in many problems, date‐of‐birth and ageprovide the same information

• some algorithms may end up giving the information fromthese attributes too much weight

Data Reduction• We can remove an attribute manually in Weka by clicking the checkbox next to the attribute in the Preprocess tab and then clicking the Remove button

– How to determine?• Experimentation

• Correlation analysis (filters in Weka)

• Undoing preprocess actions:– In the Preprocess tab, the Undo button allows you to undo actions that you

perform, including:• applying a filter to a dataset

• manually removing one or more attributes

– If you apply two filters without using Undo in between the two, the second filter will be applied to the results of the first filter

– Undo can be pressed multiple times to undo a sequence of actions

9/1/2017

35

Dividing Up the Data File• To allow us to validate the model(s) learned in data mining, we'll divide the examples into two files:

– n% for training

– 100 – n% for testing: these should not be touched until you have finalized your model or models

– possible splits:• 67/33

• 80/20

• 90/10

• Alternative to ten‐fold cross validation when you have a sufficiently large dataset

Dividing Up the Data File• You can use Weka to split the dataset for you after you perform whatever reformatting/editing is needed

• If you discretize one or more attributes, you need to do so before you divide up the data file

– otherwise, the training and test sets will be incompatible

9/1/2017

36

Dividing Up the Data File (cont.)• Here's one way to do it in Weka:

1) shuffle the examples by choosing the Randomize filter from the

filters/unsupervised/instance folder

2) save the entire file of shuffled examples in Arff format.

3) use the RemovePercentage filter from the same folder to remove some percentage of the examples• whatever percentage you're using for the training set

• click on the name of the filter to set the percentage

4) save the remaining examples in a new file• this will be our test data

5) load the full file of shuffled examples back into Weka

6) use RemovePercentage again with the same percentage as before, but set invertSelection to True

7) save the remaining examples in a new file• this will be our training data

Data Mining Input: Concepts, Instances, and Attributescs.furman.edu/~ktreu/csc272/lectures/Chapter2.pdf · Concepts, Instances, ... relationships without pre-specified number of objects

Documents