Copyright © Curt Hill 2003-2013 Data Mining A Brief Overview.
Post on 31-Dec-2015
217 Views
Preview:
Transcript
Copyright © Curt Hill 2003-2013
Data Mining
A Brief Overview
Copyright © Curt Hill 2003-2013
The Problem• Huge volumes of data overwhelm
traditional methods of data analysis such as:
• Spreadsheets• Ad hoc queries• Multidimensional analysis tools• Statistical analysis packages
Copyright © Curt Hill 2003-2013
What is Data Mining?• Exploratory data analysis based on a
data warehouse– Knowledge Discovery in Databases (KDD)
• Data Mining extracts previously unknown and potentially useful information– Rules, constraints, correlations, patterns,
signatures and irregularities
• The goal is to automate the methods for finding these in the data
Copyright © Curt Hill 2003-2013
Data Warehouse• A database usually separated from the
operational database(s)• Used as a base for decision support
systems– Upper and middle management– Not used for day to day management but
for spotting trends and making path decisions
• Typically very large and composed of recent copies from the operational database(s)
• Data Mining is one of the applications that could use
Goals of Data Mining• Prediction of future behaviors
– Seasonal or non-seasonal trends– How will consumers respond to
discounts?– Allows the enterprise to be ready
• Identification of item, event or activity– Intruders may be identified by the
files they access or programs they use
Copyright © Curt Hill 2003-2013
Goals Again• Classification of categories of users
or products– Shoppers may be categorized as:
• Discount seeking• Rush• Regular• Attached to certain brand names
– The store may be made more friendly to such
• Optimize the use of time, space, materials and money
Copyright © Curt Hill 2003-2013
Knowledge Discovery• There are several types of
discoverable knowledge– Association Rules– Classification hierarchies– Sequential patterns– Time series patterns– Clustering
• Each of these needs more information
Copyright © Curt Hill 2003-2013
Association Rules• What we are looking for is
knowledge of associations that are not obvious
• This has gained traction in market basket research– Very profitable information
• If a MRI has characteristic a and b then if often has c– This is an association rule
Copyright © Curt Hill 2003-2013
Copyright © Curt Hill 2003-2013
Market Basket Model• Premise: the items in a checkout
transaction are not random• Thus we analyze customer
transactions for patterns or association rules
• These patterns may guide decisions on – Sale items– Shelf arrangement or product
placement
Copyright © Curt Hill 2003-2013
Retail Example• A young father goes to the store to buy
disposable diapers• On his way through the store he sees a
Sports Illustrated and buys it• In general, people do not impulse buy
disposable diapers, but while buying these, they may buy something else on impulse
• Can we examine retail transaction records and perceive the connection?
Association Rule• Is of the form: X => Y
– Where both X and Y could be sets of items
• The support of this rule is the percent of total transactions that have both
• The confidence of this rule is the number of transactions which have the first one divided by the number of transactions that have both
• High support and high confidence indicates rules that business decisions may be based upon this rule– Put magazine rack on the route to the
diapersCopyright © Curt Hill 2003-2013
Copyright © Curt Hill 2003-2013
Agriculture Example• LandSat are in polar orbits• They record data on all land every 18
days• A pixel is approximately 31 yards on a
side• Seven bands from near infrared to
ultraviolet are recorded for each pixel• Each produce a 1 byte value• Can you get this data in a spreadsheet?
Copyright © Curt Hill 2003-2013
Agriculural rule
• In middle summer a near infrared value in the range 48 to 255 and red in red in range 0 to 31 suggests that the yield will be 128 to 255 bushels acre
• If the support and confidence are high this suggests that the farmer should apply nitrogen to the areas where near infrared was less than 47 and red was greater than 32
Computational Difficulties• Consider how many tickets a
supermarket or department store might generate?
• In general, most of these tickets have more than two or three items
• The store carries thousands of items• Discovering these association rules
become computationally taxing• One good reason to keep this off of
the operational databaseCopyright © Curt Hill 2003-2013
Algorithm Properties• There are a number of algorithms
for finding these rules• These typically exploit two
properties:• Downward closure
• The subset of a large itemset should also have large support
• Removing a few items does not hurt
• Antimonotocity• The superset of a small itemset should
have small supportCopyright © Curt Hill 2003-2013
Classification• Classifying data into
predetermined groups• Then we can deal with the groups
in different ways• AKA supervised learning
– Developed by Artificial Intelligence
• The process of clustering is attempting to classify data in groups that are not predetermined
Copyright © Curt Hill 2003-2013
Models• The two typical models are
decision trees and a set of rules• We look at the data to build the
model and then use the model for new data
• Consider in the next slide a decision tree for granting a credit card to an applicant
Copyright © Curt Hill 2003-2013
Example: Decision Tree
Copyright © Curt Hill 2003-2013
Married
Salary Balance
Age
Yes No
<25K >75K <5K
GoodFairPoor Poor
>5K
<25
Fair
>25
Good
Clustering• AKA unsupervised learning• Classify the data into groups that
you are not aware of to begin with• A distance function must be
supplied that describes the distance between two points– The points are often not purely numeric– They are often not in 2 dimensions or
even 3 which makes things interesting
Copyright © Curt Hill 2003-2013
Applications• Marketing
– Determine advertising, store placement, segmentation of customers
• Finance– Analysis of performance of securities
• Manufacturing– Optimizing resources, designing the
manufacturing process
• Health Care – Discovery of items in X-Ray and MRI
images
Copyright © Curt Hill 2003-2013
Example• Certain diseases switch on genes
characteristic to that disease• Drugs often switch off a gene• In 2011 database of genes and
what affected them was mined• The result was that mice infected
with small cell lung cancer were treated with an antidepressant, imipramine– The tumors were reduced
Copyright © Curt Hill 2003-2013
Telco Example• A local telephone company mines
its connection data for possible marketing opportunities
• A phone very busy in the 3PM to 6PM range suggests a teenager– Pitch a teen phone
• Busy in the 9AM to 5PM suggests a home business– Pitch a business line
Copyright © Curt Hill 2003-2013
Social Media• Publicly viewable social media
presents a very large quantity of data
• However it is:– Noisy– Unstructured– Dynamic
• It is of great interest in political campaigns, marketing, health care– This is where people express things
firstCopyright © Curt Hill 2003-2013
Finally• Much of the analysis done in data
mining has been done for centuries– What is different now is the amount
and types of captured data
• There are a number of commercial tools for mining
• Many large companies have substantial investment and return on their mining activities
Copyright © Curt Hill 2003-2013
top related