Dr. M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2010

Dr. M. Sulaiman Khan

([email protected])

Dept. of Computer Science

University of Liverpool

2010

COMP207: Data Mining

General Data Mining Issues

COMP207:Data Mining

Machine Learning?Input to Data Mining Algorithms

Data typesMissing valuesNoisy valuesInconsistent valuesRedundant valuesNumber of values

Over-fitting / Under-fittingScalabilityHuman InteractionEthical Data Mining

Today's Topics


COMP207:Data Mining

What do we mean by 'learning' when applied to machines?

Not just committing to memory (= storage) Can't require consciousness Learn facts (data), or processes (algorithms)?

“Things learn when they change their behaviour in a way that makes them perform better” (Witten)

Ties to future performance, not the act itself But things change behaviour for reasons other than

'learning' Can a machine have the Intent to perform better?

Machine Learning


COMP207:Data Mining

The aim of data mining is to learn a model for the data. This could be called a concept of the data, so our outcome will be a concept description.

Eg, the task is classify emails as spam/not spam. Concept to learn is the concept of 'what is spam?'

Input comes as instances. Eg, the individual emails.

Instances have attributes. Eg sender, date, recipient, words in text

Inputs


COMP207:Data Mining

Use attributes to determine what about an instance means that it should be classified as a particular class. == Learning!

Obvious input structure: Table of instances (rows) and attributes (columns)

Inputs


COMP207:Data Mining

IRIS DATA Sepal Length Sepal Width Petal Length Petal Width

Flower1 5.1 3.5 1.4 0.2

Flower2 4.9 3.0 1.4 0.2

Flower3 4.7 3.2 1.3 0.2

FlowerN 5.0 3.6 1.4 0.2

@relation Iris

@attribute sepal_length numeric@attribute sepal_width numeric@attribute petal_length numeric@attribute petal_width numeric

@data5.1, 3.5, 1.4, 0.24.9, 3.0, 1.4, 0.24.7, 3.2, 1.3, 0.25.0, 3.6, 1.4, 0.2...

But what about non numeric data?

WEKA's ARFF Format


COMP207:Data Mining

Nominal: Prespecified, finite number of valueseg: {cat, fish, dog, squirrel}

Includes boolean {true, false} and all enumerations.

Ordinal: Orderable, but no concept of distanceeg: hot > warm > cool > cold

Domain specific ordering, but no notion of how much hotter warm is compared to cool.

Data Types


COMP207:Data Mining

Interval: Ordered, fixed uniteg: 1990 < 1995 < 2000 < 2005

Difference between values makes sense (1995 is 5 years after 1990)

Sum does not make sense (1990 + 1995 = year 3985??)

Ratio: Ordered, fixed unit, relative to a zero pointeg: 1m, 2m, 3m, 5m

Difference makes sense (3m is 1m greater than 2m)Sum makes sense (1m + 2m = 3m)

Data Types


COMP207:Data Mining

Nominal:@attribute name {option1, option2, ... optionN}

Numeric:@attribute name numeric -- real values

Other:@attribute name string -- text fields@attribute name date -- date fields (ISO-8601 format)

ARFF Data Types


COMP207:Data Mining

The following issues will come up over and over again, but different algorithms have different requirements.

What happens if we don't know the value for a particular attribute in an instance?

For example, the data was never stored, lost or not able to be represented.

Maybe that data was important!ARFF records missing values with a ? in the table

How should we process missing values?

Data Issues: Missing Values


COMP207:Data Mining

Possible 'solutions' for dealing with missing values:

Ignore the instance completely. (eg class missing in training data set)Not very useful solution if in test data to be classified!

Fill in values by handCould be very slow, and likely to be impossible

Global 'missingValue' constantPossible for enumerations, but what about numeric data?

Replace with attribute mean Replace with class's attribute mean Train new classifier to predict missing value! Just leave as missing and require algorithm to apply

appropriate technique

Missing Values


COMP207:Data Mining

By 'noisy data' we mean random errors scattered in the data.

For example, due to inaccurate recording, data corruption.

Some noise will be very obvious: data has incorrect type (string in numeric attribute) data does not match enumeration (maybe in yes/no

field) data is very dissimilar to all other entries (10 in an attr

otherwise 0..1)

Some incorrect values won't be obvious at all. Eg typing 0.52 at data entry instead of 0.25.

Noisy Values


COMP207:Data Mining

Some possible solutions:

Manual inspection and removal Use clustering on the data to find instances or attributes

that lie outside the main body (outliers) and remove them

Use regression to determine function, then remove those that lie far from the predicted value

Ignore all values that occur below a certain frequency threshold

Apply smoothing function over known-to-be-noisy data

If noise is removed, can apply missing value techniques on it. If it is not removed, it may adversely affect the accuracy of the model.

Noisy Values


COMP207:Data Mining

Some values may not be recorded in different ways.For example 'coke', 'coca cola', 'coca-cola', 'Coca Cola' etc

etc

In this case, the data should be normalised to a single form. Can be treated as a special case of noise.

Some values may be recorded inaccurately on purpose!

Email address: r.d.nospam.sanderson@...

Spike in early census data for births on 11/11/1911. Had to put in some value, so defaulted to 1s everywhere. Ooops!

(Possibly urban legend?)

Inconsistent Values


COMP207:Data Mining

Just because the base data includes an attribute doesn't make it worth giving to the data mining task.

For example, denormalise a typical commercial database and you might have:

ProductId, ProductName, ProductPrice, SupplierId, SupplierAddress...

SupplierAddress is dependant on SupplierId (remember SQL normalisation rules?) so they will always appear together.

A 100% confident, 100% support association rule is not very interesting!

Redundant Values


COMP207:Data Mining

Is there any harm in putting in redundant values? Yes for association rule mining, and ... yes for other data mining tasks too.

Can treat text as thousands of numeric attributes: term/frequency from our inverted indexes.

But not all of those terms are useful for determining (for example) if an email is spam. 'the' does not contribute to spam detection.

The number of attributes in the table will affect the time it takes the data mining process to run. It is often the case that we want to run it many times, so getting rid of unnecessary attributes is important.

Number of Attributes


COMP207:Data Mining

Called 'dimensionality reduction'.

We'll look at techniques for this later in the course, but some simplistic versions:

Apply upper and lower thresholds of frequency Noise removal functions Remove redundant attributes Remove attributes below a threshold of contribution to

classification(Eg if attribute is evenly distributed, adds no knowledge)

Number of Attributes/Values


COMP207:Data Mining

Learning a concept must stop at the appropriate time.

For example, could express the concept of 'Is Spam?' as a list of spam emails. Any email identical to those is spam.

Accuracy: 0% on new data, 100% on training data.

Ooops! This is called Over-Fitting. The concept has been tailored too closely to the training data.

Story: US Military trained a neural network to distinguish tanks vs rocks.

It would shoot the US tanks they trained it on very consistently and never shot any rocks ... or enemy tanks. [probably fiction, but amusing]

Over-Fitting / Under-Fitting


COMP207:Data Mining

Extreme case of over-fitting:

Algorithm tries to learn a set of rules to determine class.

Rule1: attr1=val1/1 and attr2=val2/1 and attr3=val3/1 = class1Rule2: attr1=val1/2 and attr2=val2/2 and attr3=val3/2 = class2

Urgh. One rule for each instance is useless.

Need to prevent the learning from becoming too specific to the training set, but also don't want it to be too broad. Complicated!



COMP207:Data Mining

Extreme case of under-fitting:

Always pick the most frequent class, ignore the data completely.

Eg: if one class makes up 99% of the data, then a 'classifier' that always picks this class will be correct 99% of the time!

But probably the aim of the exercise is to determine the 1%, not the 99%... making it accurate 0% of the time when you need it.



COMP207:Data Mining

We may be able to reduce the number of attributes, but most of the time we're not interested in small 'toy' databases, but huge ones.

When there are millions of instances, and thousands of attributes, that's a LOT of data to try to find a model for.

Very important that data mining algorithms scale well. Can't keep all data in memory Might not be able to keep all results in memory either Might have access to distributed processing? Might be able to train on a sample of the data?

Scalability


COMP207:Data Mining

Problem Exists Between Keyboard And Chair.

Data Mining experts are probably not experts in the domain of the data. Need to work together to find out what is needed, and formulate queries

Need to work together to interpret and evaluate results Visualisation of results may be problematic Integrating into the normal workflow may be problematic How to apply the results appropriately may not be clear

(eg Barbie + Chocolate?)

Human Interaction


COMP207:Data Mining

Just because we can doesn't mean we should.

Should we include married status, gender, race, religion or other attributes about a person in a data mining experiment? Discrimination?

But sometimes those attributes are appropriate and important ... medical diagnosis, for example.

What about attributes that are dependent on 'sensitive' attributes? Neighbourhoods have different average incomes... discriminating against the poor by using location?

Privacy issues? Data Mining across time? Government sponsored data mining?

Ethical Data Mining


COMP207:Data Mining

Dr. M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2010

Documents