Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer Science and Mathematics Division Oak Ridge National Laboratory
Feb 25, 2016
Part I: Introductory MaterialsIntroduction to Data Mining
Dr. Nagiza F. SamatovaDepartment of Computer ScienceNorth Carolina State University
andComputer Science and Mathematics Division
Oak Ridge National Laboratory
2
What is common among all of them?
Who are the data producers? What data?Application Data• Application Category: Finance
• Producer: Wall Street• Data: stocks, stock prices, stock purchases,
…
• Application Category: Academia• Producer: NCSU• Data: students admission data (name, DOB,
GRE scores, transcripts, GPA, university/school attended, recommendation letters, personal statement, etc.
3
Application Categories • Finance (e.g., banks)• Entertainment (e.g., games)• Science (e.g., weather forecasting)• Medicine (e.g., disease diagnostics)• Cybersecurity (e.g., terrorists, identity theft)• Commerce (e.g., e-Commerce)• …
4
What questions to ask about the data?DataQuestions
• Academia:NCSU:Admission data1. Is there any correlation between the students’ GRE
scores and their successful completion of a PhD program?
2. What are the groups of students that share common academic performance?
3. Are there any admitted students who would stand out as an anomaly? What type of anomaly is that?
4. If the student majors in Physics, what other major is he/she likely double-major?
5
Questions by Types?• Correlation, similarity, comparison,…• Association, causality, co-occurrence,…• Grouping, clustering,…• Categorization, classification,…• Frequency or rarity of occurrence,…• Anomalous or normal objects, events,
behaviors,• Forecasting: future classes, future activity,…• …
6
What information we need to answer?QuestionsData Objects and Object Features• Academia:NCSU:Admission data
– Objects: Students– Object’s Features=Variables=Attributes=Dimensions
& Types• Name:String (e.g., Name=Neil Shah)• GPA:Numeric (e.g., GPA=5.0)• Recommendation:Text (e.g., … the top 2% in my
career…)• Etc.
7
How to compare two objects?Data Object Object Pairs
• Academia:NCSU:Admission data– Objects: Students– Based on a single feature:
• Similar GPA• The same first letter in the last name
– Based on a set of features:• Similar academic records (GPA, GRE, etc.)• Similar demographic records
– Can you compute a numerical value for your similarity measure used for comparison? Why or Why not?
8
How to represent data mathematically?Data Object & its Features Data Model
9
• What mathematical objects have you studied?– Scalar– Points– Vectors– Vector spaces– Matrices– Sets– Graphs, networks (maybe)– Tensors (maybe)– Time series (maybe)– Topological manifolds (maybe)– …
9
Data object as vector with components…
10
City=(Latitude, Longitude)--2-dimensional object
Vector components:• Features, or• Attributes, or• Dimensions Raleigh=(35.46, 78.39) Boston=(42.21, 71.5)
Proximity(Raleigh, Boston)=?• Geodesic distance• Euclidean distance• Length of the interstate route
A set of data objects as vector spaces
11
3-dimensional vector space
Latitude
Longitude
Altitude
Raleigh
Moscow
Mining such data ~ studying vector spaces
Multi-dimensional vectors…
12
S1=(John Smith, 5.0, 180, 6.0, 200)S2=(Jane Doe, 3.0, 140, 5.4, 70)
Vector components:• Features, or• Attributes, or• Dimensions
Student=(Name, GPA, Weight, Height, Income in K, …) - mutli-dimensional
Proximity(S1, S2)=?
• How to compare when vector components are of heterogeneous type, or different scales?• How to show the results of the comparison?
as matrices…
13
Original Documents
t-d term-document matrix Terms=Features=Dimensions
D1: Child Safety at HomeD2: Infant & Toddler First Aid
D3:Your Baby's Health and Safety: From Infant to Toddler
Parsed Documents
D1: Child Safety HomeD2: Infant Toddler
D3: Bab Health Safety Infant Toddler
T1: BabT2: ChildT3: HealthT4: HomeT5: InfantT6: SafetyT7: Toddler
D1: D2: D3:T1: 0 0 1T2: 1 0 0T3: 0 0 1T4: 1 0 0T5: 0 1 1T6: 1 0 1T7: 0 1 1
Example: A collection of text documents on the Web
Mining such data ~ studying matrices
or as trees
14
t-d term-document matrix
D1: D2: D3:T1: 0 0 1T2: 1 0 0T3: 0 0 1T4: 1 0 0T5: 0 1 1T6: 1 0 1T7: 0 1 1
president government party election political elected national districts held district independence vice minister parties
population area climate city miles province land topography total season 1999 square rate
economy million products 1996 growth copra economic 1997 food scale exports rice fish
D3
D2
document
terms
Is D2 similar to D3?
What if there are 10,000 terms?
Mining such data ~ studying trees
0r as networks, or graphs w/ nodes & links
15
population area climate city miles province land topography total season 1999 square rate
president government party election political elected national districts held district independence vice minister parties
economy million products 1996 growth copra economic 1997 food scale exports rice fish
Nodes=DocumentsLinks=Document similarity (e.g., if document references another document )
Mining such data ~ studying graphs, or graph mining
What apps naturally deal w/ graphs?
16Credit: Images are from Google images via search of keywords
Semantic WebSocial Networks World Wide Web
Drug Design,Chemical compounds
Computer networks Sensor networks
What questions to ask about graph data?Graph Data Graph Mining Questions
• Academia:NCSU:Admission data1. Nodes=students; links=similar
academics/demographics 2. How many distinct academically performing groups of
students admitted to NCSU?3. Which academic group is the largest?4. Given a new student applicant, can we predict which
academic group the student will likely belong to?5. Are groups of student with similar demographics
usually share similar academic performance?6. Over the last decade, has the diversity in
demographics of accepted student groups increased or decreased?
7. …
17
Recap: Data Mining and Graph Mining
18
DataApplication Questions Data Objects + Features
Mathematical Data Representation (Data Model)
Vectors
Matrices Graphs
Time series Tensors
SetsManifolds
Not one hat fits allMore than one models are neededModels are related
19
How much data?
Astrophysics CosmologyClimateBiologyEcology Web
30TB/day20-40TB/simulation1PB/year 850TB
1 TB (TeraByte) – 1012 Bytes1 PB (PetaByte) – 1015 Bytes
My laptop:60 GB (GigaBytes) – 109 Bytes
20
It is not just the Size
Petabytes DataNoisyNoisyNon-linear correlations
Non-linear correlations
‘‘+’ and ‘―
’ feed
backs
+’ and ‘―
’ feed
backsHigh-dimensional
High-dimensional
– but the Complexity
21
Data Describes Complex Patterns/Phenomena
How to untangle the riddles of the complexity?
Complex regulation Single gene
~30k genes
50 trans elements control single gene expression
Challenge:How to “connect the dots” to answer important science/business questions?
Analytical tools that find the “dots” from data significantly reduce data.
22
Connecting the Dots
Sheer Volume of DataClimateNow: 20-40 Terabytes/year5 years: 5-10 Petabytes/yearFusionNow: 100 Megabytes/15 min5 years: 1000 Megabytes/2 min
Advanced Math+AlgorithmsHuge dimensional spaceCombinatorial challengeComplicated by noisy dataRequires high-performance computers
Providing Predictive Understanding
Produce bioenergy Stabilize CO2
Clean toxic waste
Understanding the DotsFinding the Dots Connecting the Dots
23
Why Would Data Mining Matter? Enables solving many large-scale data problems
Finding the Dots Connecting the Dots Understanding the Dots• How to effectively How to effectively produce bioenergy?produce bioenergy?• How to stabilize carbon How to stabilize carbon dioxide?dioxide?• How to convert toxic How to convert toxic into non-toxic waste?into non-toxic waste?......
Science Questions
24kB/s
GB/$MMIPS/$M
CPU, Disk, Network Trend
CPU: every 1.2 yearsDisk: every 1.4 yearsWAN: 0.7 years
Doubling:
Src: Richard Mount, SLAC
How to Move and Access the Data? Technology trends are a rate limiting factor
Most of these data will NEVER be touched!
Latency and Speed – Storage Performance
105
Ret
rieva
l Rat
e M
byte
s/s
log10(Object Size Bytes)
MemoryDiskTape
J. W. Toigo, Avoiding a Data Crunch, Scientific American, May 2000
Naturally distributedNaturally distributed but effectively immovablebut effectively immovable
Streaming/DynamicStreaming/Dynamic but not re-computablebut not re-computable
Data doubles every 9 months; CPU ―18 months.
25
How to Make Sense of Data?Know Your Limits & Be Smart
To see 1 percent of a petabyte at 10 megabytes per second takes:
TerabytesPetabytes
GigabytesMegabytes
Scalability of
analysis in
full context
More analysis
Mor
e da
ta
Human
Bandwidth
Overload?
Ultrascale Computations:Must be smart about which probe combinations to see!
Physical Experiments:Must be smart about probe placement!
Not humanly possible to browse a petabyte of data. Analysis must reduce data to quantities of interest.
35 8-hour days!
26
What Analysis Algorithms to Use?Even a simple big O analysis can elucidate simplicity.
Algorithmic Complexity:Calculate means O(n)
Calculate FFT O(n log(n))
Calculate SVD O(r • c)
Clustering algorithms O(n2)
For illustration chart assumes 10-12 sec. (1Tflop/sec) calculation time per data point
3 yrs.
0.1 sec.10-2 sec.
10GB3 hrs10-3
sec.10-4 sec.
100MB
1 sec.10-5 sec.
10-6 sec.
1MB
10-4sec.10-8 sec.
10-8 sec.
10KB
10-8
sec.10-10
sec.10-
10sec.100B
n2nlog(n)
nAlgorithm Complexity
Data size n
Analysis algorithms fail for a few gigabytes.
If n=10GB, then what is O(n) or O(n2) on a teraflop computers?
1GB = 109 bytes 1Tflop = 1012 op/sec