Extraction of high-level features from scientific data sets Eui-Hong (Sam) Han Department of Computer Science and Engineering University of Minnesota Research.

Extraction of high-level features from scientific data sets

Eui-Hong (Sam) HanDepartment of Computer Science and Engineering

University of Minnesota

Research Supported by NSF, DOE, Army Research Office, AHPCRC/ARL

http://www.cs.umn.edu/~han

Joint Work with George Karypis, Ravi Jarnadan, Vipin Kumar, M. Pino Martin, Ivan Marusic, and Graham

Candler

Scientific Data Sets

Large amount of raw data available from scientific domains direct numerical simulations NASA satellite observations/climate data genomics astronomy

How do we apply existing data mining techniques on these data sets?

Direct Numerical Simulation

El Nino Effects on the Biosphere

C Potter and S. Klooster, NASA Ames Research Center

C4.5 Decision Trees

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attribute

The splitting attribute is determined

based on the Gini index or Entropy gain

Associations in Transaction Data Sets

Dependency relations among collection

of items appearing in transactions.

Dependency relations among collection

of items appearing in transactions.

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Frequent Item Sets: set of items that appear frequently together in transactions

|{Diaper, Milk}| = 3 |{Diaper,Milk,Beer}| = 2

Association Rules

Application Areas Inventory/Shelf planning Marketing and Promotion

}{},{ BeerMilkDiaper

%66|},{|

|},,{|

%20||

|},,{|

MilkDiaper

BeerMilkDiaperconfidence

T

BeerMilkDiapersupport

Challenges of Applying Data Mining Techniques

How do we construct transactions? in the presence of spatial attributes in the presence of temporal attributes

What are “interesting’’ events in the transactions? high level objects (e.g., vortex in simulation) high level features (e.g., El Nino event in

weather data) How do we find knowledge from the

transactions and interesting events?

Feature extraction from simulation data using decision trees

3-D isosurface of “swirl strength” Velocity normal to the

wall on XY plane (at z=30)

Which features are important for high upward velocityon the XY plane?

Transaction construction

Given 3D swirl strength data and corresponding velocity data on the XY plane at each simulation time step.

swirl_strength(x,y,z) = 1 iff swirl strength at (x,y,z) > swirl threshold velocity(x,y) = 1 iff upward velocity at (x,y) > velocity threshold velocity(x,y) = -1 iff downward velocity at (x,y) > velocity threshold

A transaction corresponds to a grid point on the XY plane at one time step.

Class is velocity of the grid point Attributes correspond to swirl_strength(x,y,z) of the neighbors of the

point

x

yGrid point z

ss(-1:1,2:3,4:7)

C4.5 results on the simulation data

Given simulation data of 1000 time points first 500 time points were used for training set second 500 time points were used for testing set 10% sample of class 0 transactions

95% classification accuracy Recall/precision of 0.83/0.95 for class -1 and

0.67/0.93 for class 1

Classified as-1

Classified as0

Classified as1

Class -1 6038 1220Class 0 320 125853 807Class 1 5129 10545

Discovered Rules & Features

(F1:ss(0,1,0) = 0 & ss(-1,-2:-3,-4:-7) = 1 & ss(-1:1,-2:-3,8:15) = 1 & ss(1,0,2:3) = 1) => class 1 (F2: ss(0,1,0) = 0 & ss(-1:1,-2:-3,-4:-7) = 0 & ss(1,-1,-2:-3) = 0 & ss(2:3,2:3,-16:-31) = 0 & ss(1:0:-1) = 0) => class 0 (F3: ss(0,1,0) = 0 & …. & ss(-2:-3,2:3,8:15) = 1) => class -1

F1 => class 1

How to use the discovered features?

Finding association rules (F1, Vortex Type A) => (high energy, F5)

Finding sequential patterns (F2, Vortex Type A) => (F3, Vortex Type

B) => (class 1) Finding clusters of upward velocity

points based on discovered features, vortex types, and other variables.

Finding functional relationships

http://www.cgd.ucar.edu/stats/web.book/index.html

Regression techniques find global and/or contiguous relationships

Association rules find local relationships with sufficient support

Need to find global relationships that have sufficient support

Finding functional relationships using duality transformation

Duality transformation in 2D space Point p=(a,b) => line p’ : y=ax-b Line l: y=Ax-B => point l’=(A,B) p on l => l’ on p’ l=line between p and q => l’ = intersection of p’ and q’

a

cb

d (1,-1)

a

cb

d y=x+1

Original space Transformed space Solution in the original space

Finding functional relationships using duality transformation

Given n points in d dimension, find all hyperplanes that have at least k number of data points on the hyperplane.

In the transformed space, given n hyperplanes in d dimension, find all the intersection points that have at least k hyperplanes.

Efficient algorithms to find intersections exist. These intersections corresponds to the

hyperplanes in the original space.

Functional relationships in synthetic data sets

1054 data points and 2000 noise points

Found all the intersections of two points in the transformed space

Drew a slope-sensitive grid on the transformed space

Selected grids that have above threshold intersection points

Plotted the average corresponding line of each selected grid on the original point space

Functional relationships in Ozone study

Case Studies in Environmental Statistics, by D. Nychka, W. Piegorsch, and L. Cox (http://www.cgd.ucar.edu/stats/web.book/index.html)

daily maximum ozone measurement as parts per million (ppm), temperature, wind speed, etc from 04/01/81 to 10/31/91 over Chicago area

found the most dominant functional relationship

wspd = 0.09*ozone - 0.14*temp + 2.9

Functional relationships in Ozone study

Found a less dominant functional relationshipwspd = 0.5*ozone - 0.4*temp +

3.03

This functional relationship covers only subset of data points on the lower levels of ozone measurement

Potential follow up studies what is unique about this

functional relationship? is there any unique

characteristics of the supporting set?

How to use discovered functional relationships?

Discover decision rules using both functional relationships and original variables. (supporting R1) and (Humidity > 80%) => class

high-ozone-level Discover association rules and sequential

patterns with these functional relationships ((supporting R2), Vortex Type A) => (high

upward velocity) Comparative analysis of supporting sets of

R1 and R2.

Research Issues in Finding Functional Relationships

Non-linear relationships can be found by introducing extra variables like x^2, sin(x), exp(x) for every variable x.

Spatial relationships can be found by introducing variables of neighbors.

Temporal relationships can also be found by associating time stamp with variables. 4.532 )1,1(

2)1,1(

1)0,0(

ttt zyx

Research Issues in Finding Functional Relationships

High computational cost of O(n^d) where n is the number of data points and d is the number of variables in the relationships.

Approximation algorithms are needed. Clustering data points to reduce n Focusing methods where inexact solutions are

found using faster algorithms and more accurate relationships are found focusing on these inexact solutions.

Iterative methods where the most dominant relationship is found first and less dominant relationships are found in the later iterations

Extraction of high-level features from scientific data sets Eui-Hong (Sam) Han Department of Computer Science and Engineering University of Minnesota Research.

Documents

simulation data

corresponding velocity

weather data

d swirl strength data

scientific data sets

raw data available

y velocity threshold

sample of class