Introduction Oracle Enterprise R in Practice Wrap up Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics Husnu Sensoy [email protected]Global Maksimum Data & Information Technologies October 2, 2012 Husnu Sensoy [email protected]Global Maksimum Data & Information Technologies Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
50
Embed
Database Data Mining: Practical R Enterprise and Oracle Advanced
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Introduction Oracle Enterprise R in Practice Wrap up
Database Data Mining: Practical R Enterpriseand Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Oracle R Enterprise
X Data you can process with standard R is limited with theamount of memory available on the server running R.
X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.
X ORE is an extension to standard R adding Oracle steroids intoit.
X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Oracle R Enterprise
X Data you can process with standard R is limited with theamount of memory available on the server running R.
X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.
X ORE is an extension to standard R adding Oracle steroids intoit.
X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Oracle R Enterprise
X Data you can process with standard R is limited with theamount of memory available on the server running R.
X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.
X ORE is an extension to standard R adding Oracle steroids intoit.
X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Oracle R Enterprise
X Data you can process with standard R is limited with theamount of memory available on the server running R.
X In order to bypass this problem people implement their ownsolutions in order to off-store data or utilize data samplingtechniques.
X ORE is an extension to standard R adding Oracle steroids intoit.
X The basic idea is to off-load R commands seemless to OracleDatabase or Oracle Big Data Appliance.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
This session
This session is not a R tutorial session but rather a fly over somepossible solutions to real life scenarios using R.If you need some R tutorial please refer to
X
X Rob Kabacoff. R in Action. Manning, 2010
X Oracle R Enterprise Training 2 - Introduction to R
X R Studio
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Data Visualization
Data Visualization
X Advance data analysis usually starts and ends with datavisualization.
X Before modeling anything data scientists use graphs & chartsto figure out behaviour of data
X After modeling in order to report the results they again refer tocharts.
X R supports tens of different charting & graphing packages.Just to mention two of them
lattice is used to generate conditioned graphs (a.k.a.trellis graphs)
ggplot2 is used to make graph generation moreconsistent in R.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Data Visualization
Histogram
X Do you see any significantpattern in distribution ?
X Do you like the wayhistogram is represented ?
s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” ,l o c a l=TRUE)
da t a s e t = genera teCus tomer ( )
h=h i s t ( d a t a s e t $ B i l l p e r P e r i o d , f r e q=TRUE,y l a b=”Number o f Customers ” ,x l a b=” B i l l Amount” ,main=” B i l l Amount D i s t r i b u t i o n ” )
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Data Visualization
Remove the Outliers
Do you see any significantpattern in distribution ?
s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” , l o c a l=TRUE)da t a s e t = genera teCus tomer ( )
n o o u t l i e r = f u n c t i o n ( data , column , q=0.99 , i n c=TRUE){q = q u a n t i l e ( data [ , column ] , na . rm=TRUE,
probs = quan t i l e , names=FALSE)
i f ( i n c l u s i v e ){pruned = sub s e t ( data , data [ , column ] <= q)
} e l s e{pruned = sub s e t ( data , data [ , column ] < q )
}
pruned}
pruned = n o o u t l i e r ( da ta s e t , ” B i l l p e r P e r i o d ” , 0 . 99 )
h=h i s t ( pruned $ B i l l p e r P e r i o d , f r e q=TRUE,y l a b=”Number o f Customers ” ,x l a b=” B i l l Amount” ,main=” B i l l Amount D i s t r i b u t i o n ” )
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Data Visualization
Conditional Histograms
s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” ,l o c a l=TRUE)
sou r c e ( ”˜/ r−s n i p p l e t s /oow2012/commons . r ” ,l o c a l=TRUE)
da t a s e t = genera teCus tomer ( )
pruned = n o o u t l i e r ( da ta s e t ,” B i l l p e r P e r i o d ” , 0 . 99 )
l i b r a r y ( l a t t i c e )h i s tog ram ( ˜ B i l l p e r P e r i o d | Us ingSe rv i ceX ,
data=pruned )
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Data Visualization
Too Many Columns to Visualize
s ou r c e ( ”˜/ r−s n i p p l e t s /oow2012/mydata . r ” ,l o c a l=TRUE)
sou r c e ( ”˜/ r−s n i p p l e t s /oow2012/commons . r ” ,l o c a l=TRUE)
da t a s e t = genera teCus tomer ( )head ( d a t a s e t )
pruned = n o o u t l i e r ( da ta s e t ,” B i l l p e r P e r i o d ” , 0 . 99 )
l i b r a r y ( l a t t i c e )h i s tog ram ( ˜ B i l l p e r P e r i o d | CarBrand ,
data=pruned )
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
A Bit of Probability and Information Theory
Comparing Histograms
X We need a way tocalculate similaritybetween those histograms.
X A strong tool frominformation theoryKullback—LeiblerDivergence allows us todefine a distance metricbetween two distributions.
equ iw i d th = f u n c t i o n ( data , co l , n=10, s f=1e−6){q l i s t = q u a n t i l e ( data [ , c o l ] , na . rm=TRUE,
ddf = NULLb a s e l i n e = equ iw i d th ( pruned , ” B i l l p e r P e r i o d ” )f o r ( brand i n d a t a s e t [ ! d u p l i c a t e d ( d a t a s e t [ , c ( ’ CarBrand ’ ) ] ) , 1 ] ){
b randD i s t = equ iw i d th ( s ub s e t ( pruned ,pruned [ , ’ CarBrand ’ ] == brand ) ,
” B i l l p e r P e r i o d ” )ddf = rb i n d ( ddf ,
data . f rame ( ca rb rand=brand ,k l=k l d i s t a n c e ( b a s e l i n e ,
b r andD i s t ) ) )
}
head ( ddf [ o r d e r ( ddf $ k l , d e c r e a s i n g=TRUE) , ] )
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Optimization
Problem Definition
X We a have a terrain covered by severalstations and each point on the terrainhas one of the following status
GREEN Region is in the LoS ofat least one station.
YELLOW Region is in the LoS ofat least on station butfar away.
RED Region is out of LoS.
X For a fixed number of stations weneed to cover as much as we can.
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Optimization
Model Sketch Up1
1 Define a function tocalculate the ratio ofgreen zones on terrain.
2 Give this function to oneof optimization modulesof R (Nelder — MeadTechnique) which canhandle non-smooth targetfunctions.
3 Get the optimal stationdistribution.
t a r g e t f u n c=f u n c t i o n ( o b s e r v e r ){m = mat r i x ( data=obs e r v e r , n c o l =2,byrow=TRUE)
# Compute merged s t a t u s o f a l l o b s e r v e r smergeds ta tu s <− r ep ( ” red ” , l e n g t h ( t e r r $ h e i g h t ) )f o r ( i i n seq ( 1 : dim (m) [ 1 ] ) ){
t e r r $ d i s t 2 o b s e r v e r = d i s t a n c e ( t e r r , c (m[ i , ] , 7 ) )s t a t u s = LoS ( t e r r , c (m[ i , ] , 7 ) , maxDist )me rgeds ta tu s = upda t e s t a t u s ( mergeds tatus , s t a t u s )
}
sum( mergeds ta tu s==” green ” )}
optim <− optim ( ob s e r v e r s , t a r g e t f u n c ,c o n t r o l= l i s t ( f n s c a l e=−1, t r a c e =5,
REPORT=1) )
1Refer to LoS Analysis (Part 4)Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Optimization
1 Station (54%)
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Optimization
3 Stations (83%)
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Optimization
6 Stations (99%)
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
Problem Definition
X For a given string which is writtenintentionally or erroneously wrong bysubscribers, how can we build a modelwhich can deduce the most probablestring among 3 possibilities (or choseto not making any decision).
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
Problem Definition
X For a given string which is writtenintentionally or erroneously wrong bysubscribers, how can we build a modelwhich can deduce the most probablestring among 3 possibilities (or choseto not making any decision).
Husnu Sensoy [email protected] Global Maksimum Data & Information Technologies
Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics
Introduction Oracle Enterprise R in Practice Wrap up
Text Analysis & Decision Trees
More Feature Engineering using Jaro-Winkler Algorithm
Jaro-Winkler distance is a distance metric between strings whichcan be used as a fuzzy string matching algorithm resilient to typoerrors.
l i b r a r y ( RecordL inkage )
enhanced = data . f rame ( df ,momScore = j a r o w i n k l e r ( ”mom” , d f $ o r g i n a l ) ,dadScore = j a r o w i n k l e r ( ”dad” , d f $ o r g i n a l ) ,b r o t h e r S c o r e = j a r o w i n k l e r ( ” b r o t h e r ” , d f $ o r g i n a l ) )